Dissertations / Theses: 'Graph Mining'

1

Feng, Jing. "Information-theoretic graph mining." Diss., Ludwig-Maximilians-Universität München, 2015. http://nbn-resolving.de/urn:nbn:de:bvb:19-183384.

Full text

Abstract:

Real world data from various application domains can be modeled as a graph, e.g. social networks and biomedical networks like protein interaction networks or co-activation networks of brain regions. A graph is a powerful concept to model arbitrary (structural) relationships among objects. In recent years, the prevalence of social networks has made graph mining an important center of attention in the data mining field. There are many important tasks in graph mining, such as graph clustering, outlier detection, and link prediction. Many algorithms have been proposed in the literature to solve these tasks. However, normally these issues are solved separately, although they are closely related. Detecting and exploiting the relationship among them is a new challenge in graph mining. Moreover, with data explosion, more information has already been integrated into graph structure. For example, bipartite graphs contain two types of node and graphs with node attributes offer additional non-structural information. Therefore, more challenges arise from the increasing graph complexity. This thesis aims to solve these challenges in order to gain new knowledge from graph data. An important paradigm of data mining used in this thesis is the principle of Minimum Description Length (MDL). It follows the assumption: the more knowledge we have learned from the data, the better we are able to compress the data. The MDL principle balances the complexity of the selected model and the goodness of fit between model and data. Thus, it naturally avoids over-fitting. This thesis proposes several algorithms based on the MDL principle to acquire knowledge from various types of graphs: Info-spot (Automatically Spotting Information-rich Nodes in Graphs) proposes a parameter-free and efficient algorithm for the fully automatic detection of interesting nodes which is a novel outlier notion in graph. Then in contrast to traditional graph mining approaches that focus on discovering dense subgraphs, a novel graph mining technique CXprime (Compression-based eXploiting Primitives) is proposed. It models the transitivity and the hubness of a graph using structure primitives (all possible three-node substructures). Under the coding scheme of CXprime, clusters with structural information can be discovered, dominating substructures of a graph can be distinguished, and a new link prediction score based on substructures is proposed. The next algorithm SCMiner (Summarization-Compression Miner) integrates tasks such as graph summarization, graph clustering, link prediction, and the discovery of the hidden structure of a bipartite graph on the basis of data compression. Finally, a method for non-redundant graph clustering called IROC (Information-theoretic non-Redundant Overlapping Clustering) is proposed to smartly combine structural information with non-structural information based on MDL. IROC is able to detect overlapping communities within subspaces of the attributes. To sum up, algorithms to unify different learning tasks for various types of graphs are proposed. Additionally, these algorithms are based on the MDL principle, which facilitates the unification of different graph learning tasks, the integration of different graph types, and the automatic selection of input parameters that are otherwise difficult to estimate.

APA, Harvard, Vancouver, ISO, and other styles

2

Kumar, Rohit 1986. "Temporal graph mining and distributed processing." Doctoral thesis, Universitat Politècnica de Catalunya, 2018. http://hdl.handle.net/10803/620623.

Full text

Abstract:

With the recent growth of social media platforms and the human desire to interact with the digital world a lot of human-human and human-device interaction data is getting generated every second. With the boom of the Internet of Things (IoT) devices, a lot of device-device interactions are also now on the rise. All these interactions are nothing but a representation of how the underlying network is connecting different entities over time. These interactions when modeled as an interaction network presents a lot of unique opportunities to uncover interesting patterns and to understand the dynamics of the network. Understanding the dynamics of the network is very important because it encapsulates the way we communicate, socialize, consume information and get influenced. To this end, in this PhD thesis, we focus on analyzing an interaction network to understand how the underlying network is being used. We define interaction network as a sequence of time-stamped interactions E over edges of a static graph G=(V, E). Interaction networks can be used to model many real-world networks for example, in a social network or a communication network, each interaction over an edge represents an interaction between two users, e.g., emailing, making a call, re-tweeting, or in case of the financial network an interaction between two accounts to represent a transaction. We analyze interaction network under two settings. In the first setting, we study interaction network under a sliding window model. We assume a node could pass information to other nodes if they are connected to them using edges present in a time window. In this model, we study how the importance or centrality of a node evolves over time. In the second setting, we put additional constraints on how information flows between nodes. We assume a node could pass information to other nodes only if there is a temporal path between them. To restrict the length of the temporal paths we consider a time window in this approach as well. We apply this model to solve the time-constrained influence maximization problem. By analyzing the interaction network data under our model we find the top-k most influential nodes. We test our model both on human-human interaction using social network data as well as on location-location interaction using location-based social network(LBSNs) data. In the same setting, we also mine temporal cyclic paths to understand the communication patterns in a network. Temporal cycles have many applications and appear naturally in communication networks where one person posts a message and after a while reacts to a thread of reactions from peers on the post. In financial networks, on the other hand, the presence of a temporal cycle could be indicative of certain types of fraud. We provide efficient algorithms for all our analysis and test their efficiency and effectiveness on real-world data. Finally, given that many of the algorithms we study have huge computational demands, we also studied distributed graph processing algorithms. An important aspect of distributed graph processing is to correctly partition the graph data between different machine. A lot of research has been done on efficient graph partitioning strategies but there is no one good partitioning strategy for all kind of graphs and algorithms. Choosing the best partitioning strategy is nontrivial and is mostly a trial and error exercise. To address this problem we provide a cost model based approach to give a better understanding of how a given partitioning strategy is performing for a given graph and algorithm.
Con el reciente crecimiento de las redes sociales y el deseo humano de interactuar con el mundo digital, una gran cantidad de datos de interacción humano-a-humano o humano-a-dispositivo se generan cada segundo. Con el auge de los dispositivos IoT, las interacciones dispositivo-a-dispositivo también están en alza. Todas estas interacciones no son más que una representación de como la red subyacente conecta distintas entidades en el tiempo. Modelar estas interacciones en forma de red de interacciones presenta una gran cantidad de oportunidades únicas para descubrir patrones interesantes y entender la dinamicidad de la red. Entender la dinamicidad de la red es clave ya que encapsula la forma en la que nos comunicamos, socializamos, consumimos información y somos influenciados. Para ello, en esta tesis doctoral, nos centramos en analizar una red de interacciones para entender como la red subyacente es usada. Definimos una red de interacciones como una sequencia de interacciones grabadas en el tiempo E sobre aristas de un grafo estático G=(V, E). Las redes de interacción se pueden usar para modelar gran cantidad de aplicaciones reales, por ejemplo en una red social o de comunicaciones cada interacción sobre una arista representa una interacción entre dos usuarios (correo electrónico, llamada, retweet), o en el caso de una red financiera una interacción entre dos cuentas para representar una transacción. Analizamos las redes de interacción bajo múltiples escenarios. En el primero, estudiamos las redes de interacción bajo un modelo de ventana deslizante. Asumimos que un nodo puede mandar información a otros nodos si estan conectados utilizando aristas presentes en una ventana temporal. En este modelo, estudiamos como la importancia o centralidad de un nodo evoluciona en el tiempo. En el segundo escenario añadimos restricciones adicionales respecto como la información fluye entre nodos. Asumimos que un nodo puede mandar información a otros nodos solo si existe un camino temporal entre ellos. Para restringir la longitud de los caminos temporales también asumimos una ventana temporal. Aplicamos este modelo para resolver este problema de maximización de influencia restringido temporalmente. Analizando los datos de la red de interacción bajo nuestro modelo intentamos descubrir los k nodos más influyentes. Examinamos nuestro modelo en interacciones humano-a-humano, usando datos de redes sociales, como en ubicación-a-ubicación usando datos de redes sociales basades en localización (LBSNs). En el mismo escenario también minamos camínos cíclicos temporales para entender los patrones de comunicación en una red. Existen múltiples aplicaciones para cíclos temporales y aparecen naturalmente en redes de comunicación donde una persona envía un mensaje y después de un tiempo reacciona a una cadena de reacciones de compañeros en el mensaje. En redes financieras, por otro lado, la presencia de un ciclo temporal puede indicar ciertos tipos de fraude. Proponemos algoritmos eficientes para todos nuestros análisis y evaluamos su eficiencia y efectividad en datos reales. Finalmente, dado que muchos de los algoritmos estudiados tienen una gran demanda computacional, también estudiamos los algoritmos de procesado distribuido de grafos. Un aspecto importante de procesado distribuido de grafos es el de correctamente particionar los datos del grafo entre distintas máquinas. Gran cantidad de investigación se ha realizado en estrategias para particionar eficientemente un grafo, pero no existe un particionamento bueno para todos los tipos de grafos y algoritmos. Escoger la mejor estrategia de partición no es trivial y es mayoritariamente un ejercicio de prueba y error. Con tal de abordar este problema, proporcionamos un modelo de costes para dar un mejor entendimiento en como una estrategia de particionamiento actúa dado un grafo y un algoritmo.

APA, Harvard, Vancouver, ISO, and other styles

3

Kumar, Rohit. "Temporal Graph Mining and Distributed Processing." Doctoral thesis, Universite Libre de Bruxelles, 2018. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/271527.

Full text

Abstract:

With the recent growth of social media platforms and the human desire to interact with the digital world a lot of human-human and human-device interaction data is getting generated every second. With the boom of the Internet of Things (IoT) devices, a lot of device-device interactions are also now on the rise. All these interactions are nothing but a representation of how the underlying network is connecting different entities over time. These interactions when modeled as an interaction network presents a lot of unique opportunities to uncover interesting patterns and to understand the dynamics of the network. Understanding the dynamics of the network is very important because it encapsulates the way we communicate, socialize, consume information and get influenced. To this end, in this PhD thesis, we focus on analyzing an interaction network to understand how the underlying network is being used. We define interaction network as a sequence of time-stamped interactions E over edges of a static graph G=(V, E). Interaction networks can be used to model many real-world networks for example, in a social network or a communication network, each interaction over an edge represents an interaction between two users, e.g. emailing, making a call, re-tweeting, or in case of the financial network an interaction between two accounts to represent a transaction.We analyze interaction network under two settings. In the first setting, we study interaction network under a sliding window model. We assume a node could pass information to other nodes if they are connected to them using edges present in a time window. In this model, we study how the importance or centrality of a node evolves over time. In the second setting, we put additional constraints on how information flows between nodes. We assume a node could pass information to other nodes only if there is a temporal path between them. To restrict the length of the temporal paths we consider a time window in this approach as well. We apply this model to solve the time-constrained influence maximization problem. By analyzing the interaction network data under our model we find the top-k most influential nodes. We test our model both on human-human interaction using social network data as well as on location-location interaction using location-based social network(LBSNs) data. In the same setting, we also mine temporal cyclic paths to understand the communication patterns in a network. Temporal cycles have many applications and appear naturally in communication networks where one person posts a message and after a while reacts to a thread of reactions from peers on the post. In financial networks, on the other hand, the presence of a temporal cycle could be indicative of certain types of fraud. We provide efficient algorithms for all our analysis and test their efficiency and effectiveness on real-world data.Finally, given that many of the algorithms we study have huge computational demands, we also studied distributed graph processing algorithms. An important aspect of these algorithms is to correctly partition the graph data between different machines. A lot of research has been done on efficient graph partitioning strategies but there is no one good partitioning strategy for all kind of graphs and algorithms. Choosing the best partitioning strategy is nontrivial and is mostly a trial and error exercise. To address this problem we provide a cost model based approach to give a better understanding of how a given partitioning strategy is performing for a given graph and algorithm.
Doctorat en Sciences de l'ingénieur et technologie
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

4

Niggemann, Oliver. "Visual data mining of graph based data." [S.l. : s.n.], 2001. http://deposit.ddb.de/cgi-bin/dokserv?idn=962400505.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Runelöv, Martin. "Finding seminal scientific publications with graph mining." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-172382.

Full text

Abstract:

We investigate the applicability of network analysis to the problem of finding seminal publications in scientific publishing. In particular, we focus on the network measures betweenness centrality, the so-called backbone graph, and the burstiness of citations. The metrics are evaluated using precision-related scores with respect to gold standards based on fellow programmes and manual annotation. Citation counts, PageRank, and random selection are used as baselines. We find that the backbone graph provides us with a way to possibly discover seminal publications with low citation count, and combining betweenness and burstiness gives results on par with citation count.
I detta examensarbete undersöks det huruvida analys av citeringsgrafer kan användas för att finna betydelsefulla vetenskapliga publikationer. Framför allt studeras ”betweenness”-centralitet, den så kallade ”backbone”-grafen samt ”burstiness” av citeringar. Dessa mått utvärderas med hjälp av precisionsmått med avseende på guldstandarder baserade på ’fellow’-program samt via manuell annotering. Antal citeringar, PageRank, och slumpmässigt urval används som jämförelse. Resultaten visar att ”backbone”-grafen kan bidra till att eventuellt upptäcka betydelsefulla publikationer med ett lågt antal citeringar samt att en kombination av ”betweenness” och ”burstiness” ger resultat i nivå med de man får av att räkna antal citeringar.

APA, Harvard, Vancouver, ISO, and other styles

6

Diot, Fabien. "Graph mining for object tracking in videos." Thesis, Saint-Etienne, 2014. http://www.theses.fr/2014STET4009/document.

Full text

Abstract:

Détecter et suivre les objets principaux d’une vidéo est une étape nécessaire en vue d’en décrire le contenu pour, par exemple, permettre une indexation judicieuse des données multimédia par les moteurs de recherche. Les techniques de suivi d’objets actuelles souffrent de défauts majeurs. En effet, soit elles nécessitent que l’utilisateur désigne la cible a suivre, soit il est nécessaire d’utiliser un classifieur pré-entraîné à reconnaitre une classe spécifique d’objets, comme des humains ou des voitures. Puisque ces méthodes requièrent l’intervention de l’utilisateur ou une connaissance a priori du contenu traité, elles ne sont pas suffisamment génériques pour être appliquées aux vidéos amateurs telles qu’on peut en trouver sur YouTube. Pour résoudre ce problème, nous partons de l’hypothèse que, dans le cas de vidéos dont l’arrière-plan n’est pas fixe, celui-ci apparait moins souvent que les objets intéressants. De plus, dans une vidéo, la topologie des différents éléments visuels composant un objet est supposée consistante d’une image a l’autre. Nous représentons chaque image par un graphe plan modélisant sa topologie. Ensuite, nous recherchons des motifs apparaissant fréquemment dans la base de données de graphes plans ainsi créée pour représenter chaque vidéo. Cette approche nous permet de détecter et suivre les objets principaux d’une vidéo de manière non supervisée en nous basant uniquement sur la fréquence des motifs. Nos contributions sont donc réparties entre les domaines de la fouille de graphes et du suivi d’objets. Dans le premier domaine, notre première contribution est de présenter un algorithme de fouille de graphes plans efficace, appelé PLAGRAM. Cet algorithme exploite la planarité des graphes et une nouvelle stratégie d’extension des motifs. Nous introduisons ensuite des contraintes spatio-temporelles au processus de fouille afin d’exploiter le fait que, dans une vidéo, les objets se déplacent peu d’une image a l’autre. Ainsi, nous contraignons les occurrences d’un même motif a être proches dans l’espace et dans le temps en limitant le nombre d’images et la distance spatiale les séparant. Nous présentons deux nouveaux algorithmes, DYPLAGRAM qui utilise la contrainte temporelle pour limiter le nombre de motifs extraits, et DYPLAGRAM_ST qui extrait efficacement des motifs spatio-temporels fréquents depuis les bases de données représentant les vidéos. Dans le domaine du suivi d’objets, nos contributions consistent en deux approches utilisant les motifs spatio-temporels pour suivre les objets principaux dans les vidéos. La première est basée sur une recherche du chemin de poids minimum dans un graphe connectant les motifs spatio-temporels tandis que l’autre est basée sur une méthode de clustering permettant de regrouper les motifs pour suivre les objets plus longtemps. Nous présentons aussi deux applications industrielles de notre méthode
Detecting and following the main objects of a video is necessary to describe its content in order to, for example, allow for a relevant indexation of the multimedia content by the search engines. Current object tracking approaches either require the user to select the targets to follow, or rely on pre-trained classifiers to detect particular classes of objects such as pedestrians or car for example. Since those methods rely on user intervention or prior knowledge of the content to process, they cannot be applied automatically on amateur videos such as the ones found on YouTube. To solve this problem, we build upon the hypothesis that, in videos with a moving background, the main objects should appear more frequently than the background. Moreover, in a video, the topology of the visual elements composing an object is supposed consistent from one frame to another. We represent each image of the videos with plane graphs modeling their topology. Then, we search for substructures appearing frequently in the database of plane graphs thus created to represent each video. Our contributions cover both fields of graph mining and object tracking. In the first field, our first contribution is to present an efficient plane graph mining algorithm, named PLAGRAM. This algorithm exploits the planarity of the graphs and a new strategy to extend the patterns. The next contributions consist in the introduction of spatio-temporal constraints into the mining process to exploit the fact that, in a video, the motion of objects is small from on frame to another. Thus, we constrain the occurrences of a same pattern to be close in space and time by limiting the number of frames and the spatial distance separating them. We present two new algorithms, DYPLAGRAM which makes use of the temporal constraint to limit the number of extracted patterns, and DYPLAGRAM_ST which efficiently mines frequent spatio-temporal patterns from the datasets representing the videos. In the field of object tracking, our contributions consist in two approaches using the spatio-temporal patterns to track the main objects in videos. The first one is based on a search of the shortest path in a graph connecting the spatio-temporal patterns, while the second one uses a clustering approach to regroup them in order to follow the objects for a longer period of time. We also present two industrial applications of our method

APA, Harvard, Vancouver, ISO, and other styles

7

Wang, Guan. "Graph-Based Approach on Social Data Mining." Thesis, University of Illinois at Chicago, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3668648.

Full text

Abstract:

Powered by big data infrastructures, social network platforms are gathering data on many aspects of our daily lives. The online social world is reflecting our physical world in an increasingly detailed way by collecting people's individual biographies and their various of relationships with other people. Although massive amount of social data has been gathered, an urgent challenge remain unsolved, which is to discover meaningful knowledge that can empower the social platforms to really understand their users from different perspectives.

Motivated by this trend, my research addresses the reasoning and mathematical modeling behind interesting phenomena on social networks. Proposing graph based data mining framework regarding to heterogeneous data sources is the major goal of my research. The algorithms, by design, utilize graph structure with heterogeneous link and node features to creatively represent social networks' basic structures and phenomena on top of them.

The graph based heterogeneous mining methodology is proved to be effective on a series of knowledge discovery topics, including network structure and macro social pattern mining such as magnet community detection (87), social influence propagation and social similarity mining (85), and spam detection (86). The future work is to consider dynamic relation on social data mining and how graph based approaches adapt from the new situations.

APA, Harvard, Vancouver, ISO, and other styles

8

Giatsidis, Christos. "Graph Mining and Community Evaluation with Degeneracy." Palaiseau, Ecole polytechnique, 2013. http://pastel.archives-ouvertes.fr/docs/00/95/96/15/PDF/thesisA.pdf.

Full text

Abstract:

L'étude et l'analyse des réseaux sociaux attirent l'attention d'une variété de sciences (psychologie, statistiques, sociologie). Parmi elles, le domaine de la fouille de données offre des outils pour extraire automatiquement des informations utiles sur les propriétés de ces réseaux. Plus précisément, la fouille de graphes répond au besoin de modéliser et d'étudier les réseaux sociaux en particulier dans le cas des grandes communautés que l'on trouve habituellement dans les médias en ligne oú la taille des réseaux sociaux est trop grande pour les méthodes manuelles. La modélisation générale d'un réseau social est basée sur des structures de graphes. Les sommets du graphe représentent les individus et les arêtes des actions différentes ou des types de liens sociaux entre les individus. Une communauté est définie comme un sous-graphe (d'un réseau social) et se caractérise par des liens denses. Plusieurs mesures ont été précédemment proposées pour l'évaluation des divers aspects de la qualité de ces communautés mais la plupart d'entre elles ignorent diverses propriétés des interactions entre individus (par exemple l'orientation de ces liens). Dans la recherche présentée ici, le concept de "k-core" est utilisé comme un moyen d'évaluer les communautés et d'en extraire des informations. La structure de "k-core" mesure la robustesse d'un réseau non orienté en utilisant la dégénérescence du graphe. En outre, des extensions du principe de dégénérescence sont introduites pour des réseaux dont les arêtes possèdent plus d'informations que celles non orientées. Le point de départ est l'exploration des attributs qui peuvent être extraits des graphes non orientés (réseaux sociaux). Sur ce point, la dégénérescence est utilisée pour évaluer les caractéristiques d'une collaboration entre individus et sur l'ensemble de la communauté - une propriété non capturée par les métriques sur les sommets individuels ou par les métriques d'évaluation communautaires traditionnelles. Ensuite, cette méthode est étendue aux graphes pondérés, orientés et signés afin d'offrir de nouvelles mesures d'évaluation pour les réseaux sociaux. Ces nouvelles fonctionnalités apportent des outils de mesure de la collaboration dans les réseaux sociaux oú l'on peut attribuer un poids ou un orientation à une interaction et fournir des moyens alternatifs pour capturer l'importance des individus au sein d'une communauté. Pour les graphes signés, l'extension de la dégénérescence permet de proposer des métriques supplémentaires qui peuvent être utilisées pour modéliser la confiance. De plus, nous introduisons une approche de partitionnement basée sur le traitement du graphe de manière hiérarchique, hiérarchie fournie par le principe de "core expansion sequence" qui partitionne le graphe en différents niveaux ordonnés conformément à la décomposition "k-core". Les modèles théoriques de graphes sont ensuite appliqués sur des graphes du monde réel pour examiner les tendances et les comportements. Les jeux de données explorés incluent des graphes de collaborations scientifiques et des graphes de citations (DBLP et ARXIV), une instance de graphe interne de Wikipédia et des réseaux basés sur la confiance entre les individus (par exemple Epinions et Slashdot). Les conclusions sur ces ensembles de données sont significatives et les modèles proposés offrent des résultats intuitifs
The study and analysis of social networks attract attention from a variety of Sciences (psychology, statistics, sociology). Among them, the field of Data Mining offers tools to automatically extract useful information on properties of those networks. More specifically, Graph Mining serves the need to model and investigate social networks especially in the case of large communities - usually found in online media - where social networks are prohibitively large for non-automated methodologies. The general modeling of a social network is based on graph structures. Nodes of the graph represent individuals and edges signify different actions or types of social connections between them. A community is defined as a subgraph (of a social network) and is characterized by dense connections. Various measures have been proposed to evaluate different quality aspects of such communities - in most cases ignoring various properties of the connections (e. G. Directionality). In the work presented here, the k-core concept is used as a means to evaluate communities and extract information. The k-core structure essentially measures the robustness of an undirected network through degeneracy. Further more extensions of degeneracy are introduced to networks that their edges offer more information than the undirected type. Starting point is the exploration of properties that can be extracted from undirected graphs (of social networks). On this, degeneracy is used to evaluate collaboration features - a property not captured by the single node metrics or by the established community evaluation metrics - of both individuals and the entire community. Next, this process is extended for weighted, directed and signed graphs offering a plethora of novel evaluation metrics for social networks. These new features offer measurement tools for collaboration in social networks where we can assign a weight or a direction to a connection and provide alternative ways to signify the importance of individuals within a community. For signed graphs the extension of degeneracy offers additional metrics that can be used for trust management. Moreover, a clustering approach is introduced which capitalizes on processing the graph in a hierarchical manner provided by its core expansion sequence, an ordered partition of the graph into different levels according to the k-core decomposition The graph theoretical models are then applied in real world graphs to investigate trends and behaviors. The datasets explored include scientific collaboration and citation graphs (DBLP and ARXIV), a snapshot of Wikipedia's inner graph and trust networks (e. G. Epinions and Slashdot). The findings on these datasets are interesting and the proposed models offer intuitive results

APA, Harvard, Vancouver, ISO, and other styles

9

Schenker, Adam. "Graph-Theoretic Techniques for Web Content Mining." [Tampa, Fla.] : University of South Florida, 2003. http://purl.fcla.edu/fcla/etd/SFE0000143.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

De, Lara Nathan. "Algorithmic and software contributions to graph mining." Electronic Thesis or Diss., Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAT029.

Full text

Abstract:

Depuis l'invention du PageRank par Google pour les requêtes Web à la fin des années 1990, les algorithmes de graphe font partie de notre quotidien. Au milieu des années 2000, l'arrivée des réseaux sociaux a amplifié ce phénomène, élargissant toujours plus les cas d'usage de ces algorithmes. Les relations entre entités peuvent être de multiples sortes : relations symétriques utilisateur-utilisateur pour Facebook ou LinkedIn, relations asymétriques follower-followee pour Twitter, ou encore, relations bipartites utilisateur-contenu pour Netflix ou Amazon. Toutes soulèvent des problèmes spécifiques et les applications sont nombreuses : calcul de centralité pour la mesure d'influence, le partitionnement de nœuds pour la fouille de données, la classification de nœuds pour les recommandations ou l'embedding pour la prédiction de liens en sont quelques exemples.En parallèle, les conditions d'utilisation des algorithmes de graphe sont devenues plus contraignantes. D'une part, les jeux de données toujours plus gros avec des millions d'entités et parfois des milliards de relations limite la complexité asymptotique des algorithmes pour les applications industrielles. D'autre part, dans la mesure où ces algorithmes influencent nos vies, les exigences d'explicabilité et d'équité dans le domaine de l'intelligence artificielle augmentent. Les algorithmes de graphe ne font pas exception à la règle. L'Union européenne a par exemple publié un guide de conduite pour une IA fiable. Ceci implique de pousser encore plus loin l'analyse des modèles actuels, voire d'en proposer de nouveaux.Cette thèse propose des réponses ciblées via l'analyse d'algorithmes classiques, mais aussi de leurs extensions et variantes, voire d'algorithmes originaux. La capacité à passer à l'échelle restant un critère clé. Dans le sillage de ce que le projet Scikit-learn propose pour l'apprentissage automatique sur données vectorielles, nous estimons qu'il est important de rendre ces algorithmes accessibles au plus grand nombre et de démocratiser la manipulation de graphes. Nous avons donc développé un logiciel libre, Scikit-network, qui implémente et documente ces algorithmes de façon simple et efficace. Grâce à cet outil, nous pouvons explorer plusieurs tâches classiques telles que l'embedding de graphe, le partitionnement, ou encore la classification semi-supervisée
Since the introduction of Google's PageRank method for Web searches in the late 1990s, graph algorithms have been part of our daily lives. In the mid 2000s, the arrival of social networks has amplified this phenomenon, creating new use-cases for these algorithms. Relationships between entities can be of multiple types: user-user symmetric relationships for Facebook or LinkedIn, follower-followee asymmetric ones for Twitter or even user-content bipartite ones for Netflix or Amazon. They all come with their own challenges and the applications are numerous: centrality calculus for influence measurement, node clustering for knowledge discovery, node classification for recommendation or embedding for link prediction, to name a few.In the meantime, the context in which graph algorithms are applied has rapidly become more constrained. On the one hand, the increasing size of the datasets with millions of entities, and sometimes billions of relationships, bounds the asymptotic complexity of the algorithms for industrial applications. On the other hand, as these algorithms affect our daily lives, there is a growing demand for explanability and fairness in the domain of artificial intelligence in general. Graph mining is no exception. For example, the European Union has published a set of ethics guidelines for trustworthy AI. This calls for further analysis of the current models and even new ones.This thesis provides specific answers via a novel analysis of not only standard, but also extensions, variants, and original graph algorithms. Scalability is taken into account every step of the way. Following what the Scikit-learn project does for standard machine learning, we deem important to make these algorithms available to as many people as possible and participate in graph mining popularization. Therefore, we have developed an open-source software, Scikit-network, which implements and documents the algorithms in a simple and efficient way. With this tool, we cover several areas of graph mining such as graph embedding, clustering, and semi-supervised node classification

APA, Harvard, Vancouver, ISO, and other styles

11

Casas, Roma Jordi. "Privacy-preserving and data utility in graph mining." Doctoral thesis, Universitat Autònoma de Barcelona, 2014. http://hdl.handle.net/10803/285566.

Full text

Abstract:

En los últimos años, ha sido puesto a disposición del público una gran cantidad de los datos con formato de grafo. Incrustado en estos datos hay información privada acerca de los usuarios que aparecen en ella. Por lo tanto, los propietarios de datos deben respetar la privacidad de los usuarios antes de liberar los conjuntos de datos a terceros. En este escenario, los procesos de anonimización se convierten en un proceso muy importante. Sin embargo, los procesos de anonimización introducen, generalmente, algún tipo de ruido en los datos anónimos y también en sus resultados en procesos de minería de datos. Generalmente, cuanto mayor la privacidad, mayor será el ruido. Por lo tanto, la utilidad de los datos es un factor importante a tener en cuenta en los procesos de anonimización. El equilibrio necesario entre la privacidad de datos y utilidad de éstos puede mejorar mediante el uso de medidas y métricas para guiar el proceso de anonimización, de tal forma que se minimice la pérdida de información. En esta tesis hemos trabajo los campos de la preservación de la privacidad del usuario en las redes sociales y la utilidad y calidad de los datos publicados. Un compromiso entre ambos campos es un punto crítico para lograr buenos métodos de anonimato, que permitan mejorar los posteriores procesos de minería de datos. Parte de esta tesis se ha centrado en la utilidad de los datos y la pérdida de información. En primer lugar, se ha estudiado la relación entre las medidas de pérdida de información genéricas y las específicas basadas en clustering, con el fin de evaluar si las medidas genéricas de pérdida de información son indicativas de la utilidad de los datos para los procesos de minería de datos posteriores. Hemos encontrado una fuerte correlación entre algunas medidas genéricas de pérdida de información (average distance, betweenness centrality, closeness centrality, edge intersection, clustering coefficient y transitivity) y el índice de precisión en los resultados de varios algoritmos de clustering, lo que demuestra que estas medidas son capaces de predecir el perturbación introducida en los datos anónimos. En segundo lugar, se han presentado dos medidas para reducir la pérdida de información en los procesos de modificación de grafos. La primera, Edge neighbourhood centrality, se basa en el flujo de información de a través de la vecindad a distancia 1 de una arista específica. El segundo se basa en el core number sequence y permite conservar mejor la estructura subyacente, mejorando la utilidad de los datos. Hemos demostrado que ambos métodos son capaces de preservar las aristas más importantes del grafo, manteniendo mejor las propiedades básicas estructurales y espectrales. El otro tema importante de esta tesis ha sido los métodos de preservación de la privacidad. Hemos presentado nuestro algoritmo de base aleatoria, que utiliza el concepto de Edge neighbourhood centrality para guiar el proceso de modificación preservando los bordes más importantes del grafo, logrando una menor pérdida de información y una mayor utilidad de los datos. Por último, se han desarrollado dos algoritmos diferentes para el k-anonimato en los grafos. En primer lugar, se ha presentado un algoritmo basado en la computación evolutiva. Aunque este método nos permite cumplir el nivel de privacidad deseado, presenta dos inconvenientes: la pérdida de información es bastante grande en algunas propiedades estructurales del grafo y no es lo suficientemente rápido para trabajar con grandes redes. Por lo tanto, un segundo algoritmo se ha presentado, que utiliza el micro-agregación univariante para anonimizar la secuencia de grados. Este método es cuasi-óptimo y se traduce en una menor pérdida de información y una mejor utilidad de los datos.
In recent years, an explosive increase of graph-formatted data has been made publicly available. Embedded within this data there is private information about users who appear in it. Therefore, data owners must respect the privacy of users before releasing datasets to third parties. In this scenario, anonymization processes become an important concern. However, anonymization processes usually introduce some kind of noise in the anonymous data, altering the data and also their results on graph mining processes. Generally, the higher the privacy, the larger the noise. Thus, data utility is an important factor to consider in anonymization processes. The necessary trade-off between data privacy and data utility can be reached by using measures and metrics to lead the anonymization process to minimize the information loss, and therefore, to maximize the data utility. In this thesis we have covered the fields of user's privacy-preserving in social networks and the utility and quality of the released data. A trade-off between both fields is a critical point to achieve good anonymization methods for the subsequent graph mining processes. Part of this thesis has focused on data utility and information loss. Firstly, we have studied the relation between the generic information loss measures and the clustering-specific ones, in order to evaluate whether the generic information loss measures are indicative of the usefulness of the data for subsequent data mining processes. We have found strong correlation between some generic information loss measures (average distance, betweenness centrality, closeness centrality, edge intersection, clustering coefficient and transitivity) and the precision index over the results of several clustering algorithms, demonstrating that these measures are able to predict the perturbation introduced in anonymous data. Secondly, two measures to reduce the information loss on graph modification processes have been presented. The first one, Edge neighbourhood centrality, is based on information flow throw 1-neighbourhood of a specific edge in the graph. The second one is based on the core number sequence and it preserves better the underlying graph structure, retaining more data utility. By an extensive experimental set up, we have demonstrated that both methods are able to preserve the most important edges in the network, keeping the basic structural and spectral properties close to the original ones. The other important topic of this thesis has been privacy-preserving methods. We have presented our random-based algorithm, which utilizes the concept of Edge neighbourhood centrality to drive the edge modification process to better preserve the most important edges in the graph, achieving lower information loss and higher data utility on the released data. Our method obtains a better trade-off between data utility and data privacy than other methods. Finally, two different approaches for k-degree anonymity on graphs have been developed. First, an algorithm based on evolutionary computing has been presented and tested on different small and medium real networks. Although this method allows us to fulfil the desired privacy level, it presents two main drawbacks: the information loss is quite large in some graph structural properties and it is not fast enough to work with large networks. Therefore, a second algorithm has been presented, which uses the univariate micro-aggregation to anonymize the degree sequence and reduce the distance from the original one. This method is quasi-optimal and it results in lower information loss and better data utility.

APA, Harvard, Vancouver, ISO, and other styles

12

Alkan, Sertan. "A Distributed Graph Mining Framework Based On Mapreduce." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12611588/index.pdf.

Full text

Abstract:

The frequent patterns hidden in a graph can reveal crucial information about the network the graph represents. Existing techniques to mine the frequent subgraphs in a graph database generally rely on the premise that the data can be fit into main memory of the device that the computation takes place. Even though there are some algorithms that are designed using highly optimized methods to some extent, many lack the solution to the problem of scalability. In this thesis work, our aim is to find and enumerate the subgraphs that are at least as frequent as the designated threshold in a given graph. Here, we propose a new distributed algorithm for frequent subgraph mining problem that can scale horizontally as the computing cluster size increases. The method described here, uses a partitioning method and Map/Reduce programming model to distribute the computation of frequent subgraphs. In the core of this algorithm, we make use of an existing graph partitioning method to split the given data in the distributed file system and to merge and join the computed subgraphs without losing information. The frequent subgraph computation in each split is done using another known method which can enumerate the frequent patterns. Although current algorithms can efficiently find frequent patterns, they are not parallel or distributed algorithms in that even when they partition the data, they are designed to work on a single machine. Furthermore, these algorithms are computationally expensive but not fault tolerant and are not designed to work on a distributed file system. Using the Map/Reduce paradigm, we distribute the computation of frequent patterns to every machine in a cluster. Our algorithm, first bi-partitions the data via successive Map/Reduce jobs, then invokes another Map/Reduce job to compute the subgraphs in partitions using CloseGraph, recovers the whole set by invoking a series of Map/Reduce jobs to merge-join the previously found patterns. The implementation uses an open source Map/Reduce environment, Hadoop. In our experiments, our method can scale up to large graphs, as the graph data size gets bigger, this method performs better than the existing algorithms.

APA, Harvard, Vancouver, ISO, and other styles

13

Okuno, Shingo. "Parallelization of Graph Mining using Backtrack Search Algorithm." 京都大学 (Kyoto University), 2017. http://hdl.handle.net/2433/225743.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Di, Jinchao. "Gene Co-expression Network Mining Using Graph Sparsification." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1367583964.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Faisal, S. M. "Towards Energy Efficient Data Mining & Graph Processing." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1440364739.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Liu, Haishan, and Haishan Liu. "A Graph-based Approach for Semantic Data Mining." Thesis, University of Oregon, 2012. http://hdl.handle.net/1794/12567.

Full text

Abstract:

Data mining is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. It is widely acknowledged that the role of domain knowledge in the discovery process is essential. However, the synergy between domain knowledge and data mining is still at a rudimentary level. This motivates me to develop a framework for explicit incorporation of domain knowledge in a data mining system so that insights can be drawn from both data and domain knowledge. I call such technology "semantic data mining." Recent research in knowledge representation has led to mature standards such as the Web Ontology Language (OWL) by the W3C's Semantic Web initiative. Semantic Web ontologies have become a key technology for knowledge representation and processing. The OWL ontology language is built on the W3C's Resource Description Framework (RDF) that provides a simple model to describe information resources as a graph. On the other hand, there has been a surge of interest in tackling data mining problems where objects of interest can be best described as a graph of interrelated nodes. I notice that the interface between domain knowledge and data mining can be achieved by using graph representations. Therefore I explore a graph-based approach for modeling both knowledge and data and for analyzing the combined information source from which insight can be drawn systematically. In summary, I make three main contributions in this dissertation to achieve semantic data mining. First, I develop an information integration solution based on metaheuristic optimization when data mining task require accessing heterogeneous data sources. Second, I describe how a graph interface for both domain knowledge and data can be structured by employing the RDF model and its graph representations. Finally, I describe several graph theoretic analysis approaches for mining the combined information source. I showcase the utility of the proposed methods on finding semantically associated itemsets, a particular case of the frequent pattern mining. I believe these contributions in semantic data mining can provide a novel and useful way to incorporate domain knowledge. This dissertation includes published and unpublished coauthored material.

APA, Harvard, Vancouver, ISO, and other styles

17

Rossi, Maria. "Graph Mining for Influence Maximization in Social Networks." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLX083/document.

Full text

Abstract:

La science moderne des graphes est apparue ces dernières années comme un domaine d'intérêt et a apporté des progrès significatifs à notre connaissance des réseaux. Jusqu'à récemment, les algorithmes d'exploration de données existants étaient destinés à des données structurées / relationnelles, alors que de nombreux ensembles de données nécessitent une représentation graphique, comme les réseaux sociaux, les réseaux générés par des données textuelles, les structures protéiques 3D ou encore les composés chimiques. Il est donc crucial de pouvoir extraire des informations pertinantes à partir de ce type de données et, pour ce faire, les méthodes d'extraction et d'analyse des graphiques ont été prouvées essentielles.L'objectif de cette thèse est d'étudier les problèmes dans le domaine de la fouille de graphes axés en particulier sur la conception de nouveaux algorithmes et d'outils liés à la diffusion d'informations et plus spécifiquement sur la façon de localiser des entités influentes dans des réseaux réels. Cette tâche est cruciale dans de nombreuses applications telles que la diffusion de l'information, les contrôles épidémiologiques et le marketing viral.Dans la première partie de la thèse, nous avons étudié les processus de diffusion dans les réseaux sociaux ciblant la recherche de caractéristiques topologiques classant les entités du réseau en fonction de leurs capacités influentes. Nous nous sommes spécifiquement concentrés sur la décomposition K-truss qui est une extension de la décomposition k-core. On a montré que les noeuds qui appartiennent au sous-graphe induit par le maximal K-truss présenteront de meilleurs proprietés de propagation par rapport aux critères de référence. De tels épandeurs ont la capacité non seulement d'influencer une plus grande partie du réseau au cours des premières étapes d'un processus d'étalement, mais aussi de contaminer une plus grande partie des noeuds.Dans la deuxième partie de la thèse, nous nous sommes concentrés sur l'identification d'un groupe de noeuds qui, en agissant ensemble, maximisent le nombre attendu de nœuds influencés à la fin du processus de propagation, formellement appelé Influence Maximization (IM). Le problème IM étant NP-hard, il existe des algorithmes efficaces garantissant l’approximation de ses solutions. Comme ces garanties proposent une approximation gloutonne qui est coûteuse en termes de temps de calcul, nous avons proposé l'algorithme MATI qui réussit à localiser le groupe d'utilisateurs qui maximise l'influence, tout en étant évolutif. L'algorithme profite des chemins possibles créés dans le voisinage de chaque nœud et précalcule l'influence potentielle de chaque nœud permettant ainsi de produire des résultats concurrentiels, comparés à ceux des algorithmes classiques.Finallement, nous étudions le point de vue de la confidentialité quant au partage de ces bons indicateurs d’influence dans un réseau social. Nous nous sommes concentrés sur la conception d'un algorithme efficace, correct, sécurisé et de protection de la vie privée, qui résout le problème du calcul de la métrique k-core qui mesure l'influence de chaque noeud du réseau. Nous avons spécifiquement adopté une approche de décentralisation dans laquelle le réseau social est considéré comme un système Peer-to-peer (P2P). L'algorithme est construit de telle sorte qu'il ne devrait pas être possible pour un nœud de reconstituer partiellement ou entièrement le graphe en utilisant les informations obtiennues lors de son exécution. Notre contribution est un algorithme incrémental qui résout efficacement le problème de maintenance de core en P2P tout en limitant le nombre de messages échangés et les calculs. Nous fournissons également une étude de sécurité et de confidentialité de la solution concernant la désanonymisation des réseaux, nous montrons ainsi la rélation avec les strategies d’attaque précédemment definies tout en discutant les contres-mesures adaptés
Modern science of graphs has emerged the last few years as a field of interest and has been bringing significant advances to our knowledge about networks. Until recently the existing data mining algorithms were destined for structured/relational data while many datasets exist that require graph representation such as social networks, networks generated by textual data, 3D protein structures and chemical compounds. It has become therefore of crucial importance to be able to extract meaningful information from that kind of data and towards this end graph mining and analysis methods have been proven essential. The goal of this thesis is to study problems in the area of graph mining focusing especially on designing new algorithms and tools related to information spreading and specifically on how to locate influential entities in real-world networks. This task is crucial in many applications such as information diffusion, epidemic control and viral marketing. In the first part of the thesis, we have studied spreading processes in social networks focusing on finding topological characteristics that rank entities in the network based on their influential capabilities. We have specifically focused on the K-truss decomposition which is an extension of the core decomposition of the graph. Extensive experimental analysis showed that the nodes that belong to the maximal K-truss subgraph show a better spreading behavior when compared to baseline criteria. Such spreaders can influence a greater part of the network during the first steps of a spreading process but also the total fraction of the influenced nodes at the end of the epidemic is greater. We have also observed that node members of such dense subgraphs are those achieving the optimal spreading in the network.In the second part of the thesis, we focused on identifying a group of nodes that by acting all together maximize the expected number of influenced nodes at the end of the spreading process, formally called Influence Maximization (IM). The IM problem is actually NP-hard though there exist approximation guarantees for efficient algorithms that can solve the problem while obtaining a solution within the 63% of optimal classes of models. As those guarantees propose a greedy approximation which is computationally expensive especially for large graphs, we proposed the MATI algorithm which succeeds in locating the group of users that maximize the influence while also being scalable. The algorithm takes advantage the possible paths created in each node’s neighborhood to precalculate each node’s potential influence and produces competitive results in quality compared to those of baseline algorithms such as the Greedy, LDAG and SimPath. In the last part of the thesis, we study the privacy point of view of sharing such metrics that are good influential indicators in a social network. We have focused on designing an algorithm that addresses the problem of computing through an efficient, correct, secure, and privacy-preserving algorithm the k-core metric which measures the influence of each node of the network. We have specifically adopted a decentralization approach where the social network is considered as a Peer-to-peer (P2P) system. The algorithm is built based on the constraint that it should not be possible for a node to reconstruct partially or entirely the graph using the information they obtain during its execution. While a distributed algorithm that computes the nodes’ coreness is already proposed, dynamic networks are not taken into account. Our main contribution is an incremental algorithm that efficiently solves the core maintenance problem in P2P while limiting the number of messages exchanged and computations. We provide a security and privacy analysis of the solution regarding network de-anonimization and show how it relates to previously defined attacks models and discuss countermeasures

APA, Harvard, Vancouver, ISO, and other styles

18

Berlingerio, Michele. "Graph and network data: mining the temporal dimension." Thesis, IMT Alti Studi Lucca, 2009. http://e-theses.imtlucca.it/21/1/Berlingerio_phdthesis.pdf.

Full text

Abstract:

In the last years, there have been many studies on analyzing network and graph data. A wide range of problems, such as studying the global and local properties of a graph, finding interesting structures, modeling particular characteristics, assessing the properties of some particular networks such as the Web or a co-authorship networks, have increased the attention of the scientific community, involved in finding efficient and powerful techniques to enable the achievement of the desired results. For example, with the aim of finding interesting and frequent substructures in graphs, algorithms such as AGM, FSG, gSpan, Gaston, FFSMY, ADI-Mine, HSIGRAM and VSIGRAM have been presented for improving scalability on mining subgraphs one after one. However, only in the last few years the attention has moved to a particular aspect of graphs and networks: the temporal dimension. Thanks also to the larger availability of online social network services, the amount of data that allows for the analysis of the dynamics of complex networks has increased very fast in the last five years. This kind of data contains rich information about what happens to a network during time, and enables the analysts to model and discover interesting properties related to the temporal dimension, which are both meaningless and impossible in the static setting. The temporal dimension can play a double role for a network. First, the underlying structure, namely the graph, can evolve over time, showing new users joining a community, new connections created among users, change of properties of a particular group of people, and so on. Second, given an established network, users may perform actions during time, leading to flows of information circulating among the connections, sequences of tasks performed by a sequence of users, spread of influence among the network, and so on. Despite the clear richness of the above setting, the current graph mining techniques are somehow too generic, and they do not explicitly take into consideration the time during their stages. In order to overcome to this problem, in this thesis we study the current graph mining algorithms, we study the possibility of pushing constraints during the computation that would allow us to efficiently analyze the temporal dimension at mining stage, and we develop new techniques that can help in this kind of analysis. In order to prove the effectiveness of our approach, we apply a pre-existent graph miner, a modified version of it specialized to deal with the temporal dimension, and another pre-existent tool of analysis, namely the Temporally Annotated Sequences framework, to real data, to show how we can deal with the above setting, with particular focus on problems such as mining the information propagation in a network, mining graph evolution rules, and mining the temporal dimension of process logs to derive the actual workflow diagram in a process. Our results justify the need for this approach, and show that specialized techniques help in modeling and analyzing temporal graph and network data.

APA, Harvard, Vancouver, ISO, and other styles

19

Kang, U. "Mining Tera-Scale Graphs: Theory, Engineering and Discoveries." Research Showcase @ CMU, 2012. http://repository.cmu.edu/dissertations/160.

Full text

Abstract:

How do we find patterns and anomalies, on graphs with billions of nodes and edges, which do not fit in memory? How to use parallelism for such Tera- or Peta-scale graphs? In this thesis, we propose PEGASUS, a large scale graph mining system implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. PEGASUS includes algorithms which help us spot patterns and anomalous behaviors in large graphs. PEGASUS enables the structure analysis on large graphs. We unify many different structure analysis algorithms, including the analysis on connected components, PageRank, and radius/diameter, into a general primitive called GIM-V. GIM-V is highly optimized, achieving good scale-up on the number of edges and available machines. We discover surprising patterns using GIM-V, including the 7-degrees of separation in one of the largest publicly available Web graphs, with 7 billion edges. PEGASUS also enables the inference and the spectral analysis on large graphs. We design an efficient distributed belief propagation algorithm which infer the states of unlabeled nodes given a set of labeled nodes. We also develop an eigensolver for computing top k eigenvalues and eigenvectors of the adjacency matrices of very large graphs. We use the eigensolver to discover anomalous adult advertisers in the who-follows-whom Twitter graph with 3 billion edges. In addition, we develop an efficient tensor decomposition algorithm and use it to analyze a large knowledge base tensor. Finally, PEGASUS allows the management of large graphs. We propose efficient graph storage and indexing methods to answer graph mining queries quickly. We also develop an edge layout algorithm for better compressing graphs.

APA, Harvard, Vancouver, ISO, and other styles

20

Ingalalli, Vijay. "Querying and Mining Multigraphs." Thesis, Montpellier, 2017. http://www.theses.fr/2017MONTS080/document.

Full text

Abstract:

Avec des volumes de données et d’informations de plus en plus importants, des données de plus en plus complexes et fortement inter-reliées, l’extraction de connaissances reste un véritable défi. Les graphes offrent actuellement un support de représentation efficace pour représenter ces données. Parmi les approches existantes, les multi-graphes ont montré que leur pouvoir d’expression était particulièrement adapté pour manipuler des données complexes possédant de nombreux types de relations entre elles. Cette thèse aborde deux aspects principaux liés aux multigraphes : la recherche de sous graphes et la fouille de sous graphes fréquents dans des multigraphes.Elle propose trois propositions dans le domaines du requêtage et de la fouille de données.La première contribution s’inscrit dans la recherche de sous graphes et concerne l’isomorphisme de sous graphes dans des multigraphes. Cette approche peut, par exemple, être appliquée dans de nombreux domaines d’applications comme l’analyse d’images satellites ou de réseaux sociaux. Dans la seconde, nous nous intéressons aux graphes de connaissances et abordons la problématique de l’homorphisme de graphes dans des multigraphes RDF. Dans les deux contributions, nous proposons de nouvelles techniques d’indexations pour représenter efficacement les informations contenues dans les multigraphes. La recherche des sous graphes tire avantage de ces nouveaux index et différentes heuristiques et optimisations sont également proposées pour garantir de bonnes performances lors de l’exécution des requêtes. La seconde contribution s’inscrit dans le domaine de la fouille de données et nous proposons un algorithme efficace pour extraire les multigraphes fréquents. Etant donné l’espace de recherche à considérer, la recherche de motifs fréquents dans des graphes est un problème difficile en fouille de données. Pour parcourir efficacement l’espace de recherche encore plus volumineux pour les multigraphes, nous proposons de nouvelles techniques et méthodes pour le traverser efficacement notamment en éliminant des candidats où détectant à l’avance les motifs non fréquents. Pour chacune de ces propositions de nombreuses expérimentations sont réalisées pour valider à la fois leurs performances et exactitudes en les comparant avec les approches existantes. Finalement, nous proposons une étude de cas sur des jeux de données issues d’images satellites modélisées sous la forme de multigraphe et montrons que l’application de nos propositions permet de mettre en évidence de nouvelles connaissances utiles
With the ever-increasing growth of data and information, extracting the right knowledge has become a real challenge.Further, the advanced applications demand the analysis of complex, interrelated data which cannot be adequately described using a propositional representation. The graph representation is of great interest for the knowledge extraction community, since graphs are versatile data structures and are one of the most general forms of data representation. Among several classes of graphs, textit{multigraphs} have been captivating the attention in the recent times, thanks to their inherent property of succinctly representing the entities by allowing the rich and complex relations among them.The focus of this thesis is streamlined into two themes of knowledge extraction; one being textit{knowledge retrieval}, where we focus on the subgraph query matching aspects in multigraphs, and the other being textit{knowledge discovery}, where we focus on the problem of frequent pattern mining in multigraphs.This thesis makes three main contributions in the field of query matching and data mining.The first contribution, which is very generic, addresses querying subgraphs in multigraphs that yields isomorphic matches, and this problem finds potential applications in the domains of remote sensing, social networks, bioinformatics, chemical informatics. The second contribution, which is focussed on knowledge graphs, addresses querying subgraphs in RDF multigraphs that yield homomorphic matches. In both the contributions, we introduce efficient indexing structures that capture the multiedge information. The query matching processes introduced have been carefully optimized, w.r.t. the time performance and the heuristics employed assure robust performance.The third contribution is in the field of data mining, where we propose an efficient frequent pattern mining algorithm for multigraphs. We observe that multigraphs pose challenges while exploring the search space, and hence we introduce novel optimization techniques and heuristic search methods to swiftly traverse the search space.For each proposed approach, we perform extensive experimental analysis by comparing with the existing state-of-the-art approaches in order to validate the performance and correctness of our approaches.In the end, we perform a case study analysis on a remote sensing dataset. Remote sensing dataset is modelled as a multigraph, and the mining and query matching processes are employed to discover some useful knowledge

APA, Harvard, Vancouver, ISO, and other styles

21

Wang, Changliang. "Continuous subgraph pattern search over graph streams /." View abstract or full-text, 2009. http://library.ust.hk/cgi/db/thesis.pl?CSED%202009%20WANG.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Feng, Jing [Verfasser], and Christian [Akademischer Betreuer] Böhm. "Information-theoretic graph mining / Jing Feng. Betreuer: Christian Böhm." München : Universitätsbibliothek der Ludwig-Maximilians-Universität, 2015. http://d-nb.info/1073825892/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Hossain, Md Shakhawat. "Context Specific Module Mining from Multiple Co-Expression Graph." Thesis, North Dakota State University, 2017. https://hdl.handle.net/10365/28664.

Full text

Abstract:

Gene co-expression networks can be used to associate genes of unknown function with biological processes or to find genes in a specific context, environment responsible for a disease. We provide an overview of methods and tools used to identify such recurrent patterns across multiple networks, can be used to discover biological modules in co-expression networks constructed from gene expression data and we explain how this can be used to identify genes with a regulatory role in disease. However, existing algorithms are very much costly in terms of time and space. As network size or number increases, mining such modules get much more complex. We have developed an efficient approach to mine such recurrent context specific modules from 35 gene networks. This computationally very difficult problem due to the exponential number of patterns was solved non-exponentially.

APA, Harvard, Vancouver, ISO, and other styles

24

Le, Thanh Nam. "Graph representation and mining applied in comic images retrieval." Thesis, La Rochelle, 2019. http://www.theses.fr/2019LAROS008.

Full text

Abstract:

Nous abordons d’abord le problème de la représentation sous forme d’un graphe de l’image et de ses applications en reconnaissance des formes, en mettant l’accent sur les applications de recherche d’images basées sur le contenu (CBIR). Les images utilisées dans cette thèse sont des images de bandes dessinées, qui possèdent des spécificités qui sont des freins pour les méthodes de recherche d’information par le contenu utilisées dans la littérature. Le contenu des bandes dessinées, tel que les objets et les personnages, est complexe. La représentation des personnages, par exemple, peut, d’une case à l’autre, varier énormément en taille et avec différents effets de perspective, selon la situation que l’auteur souhaite retranscrire. Nous proposons ainsi une représentation qui permet d’obtenir des graphes stables et qui conserve des informations structurelles de haut niveau pour les objets d’intérêt dans les images de bandes dessinées. Ensuite, nous étendons le problème d’indexation et d’appariement aux structures de graphes représentant les images d’une bande dessinée et nous l’appliquons au problème de la recherche d’information. Un album de bandes dessinées est ainsi transformé en une base de graphes, chaque graphe correspondant à la description d’une seule case. La stratégie utilisée pour retrouver un objet ou un personnage donné, consiste donc à rechercher des motifs fréquents (ou des sous-structures fréquentes) dans cette base de graphes. Cette étape nécessite de surmonter le problème de non-répétabilité provoqué par les erreurs introduites dans la structure du graphe pendant la phase de construction dues notamment à la variabilité des dessins. Il apparait donc un écart sémantique entre le graphe et le contenu de l’image de bande dessinée
In information retrieval tasks from image databases where content representation is based on graphs, the evaluation of similarity is based both on the appearance of spatial entities and on their mutual relationships. In this thesis we present a novel scheme of Attributed Relational Adjacency Graphs representation and mining, which has been applied in content-based retrieval of comic images. The images used in this thesis are comics images, which have their inherent difficulties in applying content-based retrieval, such as their abstractness, partial occlusion, scale change and shape deformation due to viewpoint changes. We propose a graph representation that yields stable graphs and allow to retain high-level and structural information of objects of interest in comic images. Next, we extend the indexing and matching problem to graph structures representing the comic image, and apply it to the problem of retrieval. The graphs in the graph database representing the whole comic volume are then mined for frequent patterns (or frequent substructures). This step is to overcome the non-repeatability problem caused by the unavoidable errors introduced into the graph structure during the graph construction stage, which ultimately create a semantic gap between the graph and the content of the comic image. Finally, we demonstrate the effectiveness of the system with a database of annotated comic images. Experiments of performance measures is addressed to evaluate the performance of this CBIR system

APA, Harvard, Vancouver, ISO, and other styles

25

Kolli, Lakshmi Priya. "Mining for Frequent Community Structures using Approximate Graph Matching." University of Cincinnati / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1623166375110273.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Nabhan, Ahmed Ragab. "Graph Pattern Mining Techniques to Identify Potential Model Organisms." ScholarWorks @ UVM, 2014. http://scholarworks.uvm.edu/graddis/4.

Full text

Abstract:

Recent advances in high throughput technologies have led to an increasing amount of rich and diverse biological data and related literature. Model organisms are classically selected as subjects for studying human disease based on their genotypic and phenotypic features. A significant problem with model organism identification is the determination of characteristic features related to biological processes that can provide insights into the mechanisms underlying diseases. These insights could have a positive impact on the diagnosis and management of diseases and the development of therapeutic drugs. The increased availability of biological data presents an opportunity to develop data mining methods that can address these challenges and help scientists formulate and test data-driven hypotheses. In this dissertation, data mining methods were developed to provide a quantitative approach for the identification of potential model organisms based on underlying features that may be correlated with disease manifestation in humans. The work encompassed three major types of contributions that aimed to address challenges related to inferring information from biological data available from a range of sources. First, new statistical models and algorithms for graph pattern mining were developed and tested on diverse genres of data (biological networks, drug chemical compounds, and text documents). Second, data mining techniques were developed and shown to identify characteristic disease patterns (disease fingerprints), predict potentially new genetic pathways, and facilitate the assessment of organisms as potential disease models. Third, a methodology was developed that combined the application of graph-based models with information derived from natural language processing methods to identify statistically significant patterns in biomedical text. Together, the approaches developed for this dissertation show promise for summarizing the information about biological processes and phenomena associated with organisms broadly and for the potential assessment of their suitability to study human diseases.

APA, Harvard, Vancouver, ISO, and other styles

27

Hu, Chenhui. "Graph-Based Data Mining in Neuroimaging of Neurological Diseases." Thesis, Harvard University, 2016. http://nrs.harvard.edu/urn-3:HUL.InstRepos:33493468.

Full text

Abstract:

The human brain is one of the most complex network systems. Understanding the organization of the brain is crucial for the diagnosis and treatment of brain disorders. Recent advances in neuroimaging enable a non-invasive in vivo observation of the structural and functional behaviors of the human brain. However, it is challenging to find meaningful structures from the high-dimensional imaging data and develop robust frameworks for the detection of brain diseases, especially Alzheimer's disease (AD). To tackle these challenges, we investigate brain imaging data from a network perspective in this dissertation. We first analyze a high-resolution brain network targeting the hippocampal area based on amyloid depositions. We improve the quality of the positron emission tomography (PET) images with a deblurring algorithm and segment the hippocampal area into subregions. We find that the network features clearly indicate the stages of AD. Then, we propose a spectral graph regression model (GRM) to learn brain network structures. The proposed GRM regards the imaging data as smooth signals on an unknown graph. By solving inverse problems, a more physiologically meaningful brain network is estimated in comparison with the state-of-the-art method. We also study the sample complexity of learning graphical models with block structures. Next, we develop a matched signal detection (MSD) theory for signals with intrinsic structures described by weighted graphs. To model different real data, we present hypothesis tests by considering low-pass, variation bounded, and random signals on graphs. Real data evaluations demonstrate the effectiveness of the method. Moreover, we introduce an AD detection framework by exploiting deep learning and brain network structures. The classification accuracy of a targeted autoencoder network combing with network information is much higher than that of traditional approaches. Finally, we present a network diffusion model with sources to localize the origins of AD. By imposing a sparsity constraint on the number of sources, we solve the inverse problem efficiently. In addition, we precisely predict the changes of brain atrophy patterns through this model.
Engineering and Applied Sciences - Engineering Sciences

APA, Harvard, Vancouver, ISO, and other styles

28

Karunaratne, Thashmee M. "Learning predictive models from graph data using pattern mining." Doctoral thesis, Stockholms universitet, Institutionen för data- och systemvetenskap, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-100713.

Full text

Abstract:

Learning from graphs has become a popular research area due to the ubiquity of graph data representing web pages, molecules, social networks, protein interaction networks etc. However, standard graph learning approaches are often challenged by the computational cost involved in the learning process, due to the richness of the representation. Attempts made to improve their efficiency are often associated with the risk of degrading the performance of the predictive models, creating tradeoffs between the efficiency and effectiveness of the learning. Such a situation is analogous to an optimization problem with two objectives, efficiency and effectiveness, where improving one objective without the other objective being worse off is a better solution, called a Pareto improvement. In this thesis, it is investigated how to improve the efficiency and effectiveness of learning from graph data using pattern mining methods. Two objectives are set where one concerns how to improve the efficiency of pattern mining without reducing the predictive performance of the learning models, and the other objective concerns how to improve predictive performance without increasing the complexity of pattern mining. The employed research method mainly follows a design science approach, including the development and evaluation of artifacts. The contributions of this thesis include a data representation language that can be characterized as a form in between sequences and itemsets, where the graph information is embedded within items. Several studies, each of which look for Pareto improvements in efficiency and effectiveness are conducted using sets of small graphs. Summarizing the findings, some of the proposed methods, namely maximal frequent itemset mining and constraint based itemset mining, result in a dramatically increased efficiency of learning, without decreasing the predictive performance of the resulting models. It is also shown that additional background knowledge can be used to enhance the performance of the predictive models, without increasing the complexity of the graphs.

APA, Harvard, Vancouver, ISO, and other styles

29

Huang, Zan, Wingyan Chung, and Hsinchun Chen. "A Graph Model for E-Commerce Recommender Systems." Wiley Periodicals, Inc, 2004. http://hdl.handle.net/10150/105683.

Full text

Abstract:

Artificial Intelligence Lab, Department of MIS, University of Arizona
Information overload on the Web has created enormous challenges to customers selecting products for online purchases and to online businesses attempting to identify customersâ preferences efficiently. Various recommender systems employing different data representations and recommendation methods are currently used to address these challenges. In this research, we developed a graph model that provides a generic data representation and can support different recommendation methods. To demonstrate its usefulness and flexibility, we developed three recommendation methods: direct retrieval, association mining, and high-degree association retrieval. We used a data set from an online bookstore as our research test-bed. Evaluation results showed that combining product content information and historical customer transaction information achieved more accurate predictions and relevant recommendations than using only collaborative information. However, comparisons among different methods showed that high-degree association retrieval did not perform significantly better than the association mining method or the direct retrieval method in our test-bed.

APA, Harvard, Vancouver, ISO, and other styles

30

AlAbed-AlHaq, Abrar Fawwaz. "APPLYING GRAPH MINING TECHNIQUES TO SOLVE COMPLEX SOFTWARE ENGINEERING PROBLEMS." Kent State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=kent1442986844.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Pech, Palacio Manuel Alfredo. "Spatial data modeling and mining using a graph-based representation." Lyon, INSA, 2005. http://theses.insa-lyon.fr/publication/2005ISAL0118/these.pdf.

Full text

Abstract:

Est proposé un unique modèle basé sur des graphes pour représenter des données spatiales, les données non-spatiales et les relations entre les objets spatiaux. Ainsi un graphe est généré à partir de ces trois éléments. On considère que l'outil de fouille de données basé sur les graphes peut découvrir des patterns incluant ces trois éléments, selon trois types de relation spatiale (topologique, cardinale et de distance). Dans notre modèle, les données spatiales, non-spatiales (attributs non-spatiaux), et les relations spatiales représentent une collections d'un ou plusieurs graphes orientés. Les sommets représentent soit les objets spatiaux, soit les relations spatiales entre deux objets spatiaux, ou les attributs non-spatiaux. De plus, un sommet peut représenter soit un attribut, soit le nom d'une relation spatiale. Les noms des attributs peuvent référencer des objets spatiaux ou non-spatiaux. Les arcs orientés sont utilisés pour représenter des informations directionnelles sur les relations entre les éléments, et pour décrire les attributs des objets. On a adopté SUBDUE comme un outil de fouille de graphes. Une caractéristique particulière dite de recouvrement joue un rôle important dans la découverte de patterns. Cependant, elle peut-être implémentée pour recouvrir la totalité du graphe, ou bien ne considérer aucun sommet. En conséquence, nous proposons une troisième piste nommée recouvrement limité, laquelle donne à l'utilisateur la capacité de choisir le recouvrement. On analyse directement trois caractéristiques de l'algorithme proposé, la réduction de l'espace de recherche, la réduction du temps de calcul, et la découverte de patterns grâce à ce type de recouvrement
We propose a unique graph-based model to represent spatial data, non-spatial data and the spatial relations among spatial objects. We will generate datasets composed of graphs with a set of these three elements. We consider that by mining a dataset with these characteristics a graph-based mining tool can search patterns involving all these elements at the same time improving the results of the spatial analysis task. A significant characteristic of spatial data is that the attributes of the neighbors of an object may have an influence on the object itself. So, we propose to include in the model three relationship types (topological, orientation, and distance relations). In the model the spatial data (i. E. Spatial objects), non-spatial data (i. E. Non-spatial attributes), and spatial relations are represented as a collection of one or more directed graphs. A directed graph contains a collection of vertices and edges representing all these elements. Vertices represent either spatial objects, spatial relations between two spatial objects (binary relation), or non-spatial attributes describing the spatial objects. Edges represent a link between two vertices of any type. According to the type of vertices that an edge joins, it can represent either an attribute name or a spatial relation name. The attribute name can refer to a spatial object or a non-spatial entity. We use directed edges to represent directional information of relations among elements (i. E. Object x touches object y) and to describe attributes about objects (i. E. Object x has attribute z). We propose to adopt the Subdue system, a general graph-based data mining system developed at the University of Texas at Arlington, as our mining tool. A special feature named overlap has a primary role in the substructures discovery process and consequently a direct impact over the generated results. However, it is currently implemented in an orthodox way: all or nothing. Therefore, we propose a third approach: limited overlap, which gives the user the capability to set over which vertices the overlap will be allowed. We visualize directly three motivations issues to propose the implementation of the new algorithm: search space reduction, processing time reduction, and specialized overlapping pattern oriented search

APA, Harvard, Vancouver, ISO, and other styles

32

Pech, Palacio Manuel Alfredo Laurini Robert Tchounikine Anne Sol Martínez David. "Spatial data modeling and mining using a graph-based representation." Villeurbanne : Doc'INSA, 2006. http://docinsa.insa-lyon.fr/these/pont.php?id=pech_palacio.

Full text

Abstract:

Thèse doctorat : Informatique : Villeurbanne, INSA : 2005. Thèse doctorat : Informatique : Universidad de las Américas - Puebla : 2005.
Thèse soutenue en co-tutelle. Thèse rédigée en français, en anglais et en espagnol. Titre provenant de l'écran-titre. Bibliogr. p. 174-182.

APA, Harvard, Vancouver, ISO, and other styles

33

Abdlwafa, Alan, and Henrik Edman. "Distributed Graph Mining : A study of performance advantages in distributed data mining paradigms when processing graphs using PageRank on a single node cluster." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-166449.

Full text

Abstract:

Distributed data mining is a relatively new area within computer science that is steadily growing, emerging from the demands of being able to gather and process various distributed data by utilising clusters. This report presents the properties of graph structured data and what paradigms to use for efficiently processing the data type, based on comprehensive theoretical studies applied on practical tests performed on a single node cluster. The results in the study showcase the various performance aspects of processing graph data, using different open source paradigm frameworks and amount of shards used on input. A conclusion to be drawn from this study is that there are no real performance advantages to using distributed data mining paradigms specifically developed for graph data on single machines.

APA, Harvard, Vancouver, ISO, and other styles

34

Fang, Chunsheng. "Novel Frameworks for Mining Heterogeneous and Dynamic Networks." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1321369978.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Maxwell, Evan Kyle. "Graph Mining Algorithms for Memory Leak Diagnosis and Biological Database Clustering." Thesis, Virginia Tech, 2010. http://hdl.handle.net/10919/34008.

Full text

Abstract:

Large graph-based datasets are common to many applications because of the additional structure provided to data by graphs. Patterns extracted from graphs must adhere to these structural properties, making them a more complex class of patterns to identify. The role of graph mining is to efficiently extract these patterns and quantify their significance. In this thesis, we focus on two application domains and demonstrate the design of graph mining algorithms in these domains. First, we investigate the use of graph grammar mining as a tool for diagnosing potential memory leaks from Java heap dumps. Memory leaks occur when memory that is no longer in use fails to be reclaimed, resulting in significant slowdowns, exhaustion of available storage, and eventually application crashes. Analyzing the heap dump of a program is a common strategy used in memory leak diagnosis, but our work is the first to employ a graph mining approach to the problem. Memory leaks accumulate in the heap as classes of subgraphs and the allocation paths from which they emanate can be explored to contextualize the leak source. We show that it suffices to mine the dominator tree of the heap dump, which is significantly smaller than the underlying graph. We demonstrate several synthetic as well as real-world examples of heap dumps for which our approach provides more insight into the problem than state-of-the-art tools such as Eclipse's MAT. Second, we study the problem of multipartite graph clustering as an approach to database summarization on an integrated biological database. Construction of such databases has become a common theme in biological research, where heterogeneous data is consolidated into a single, centralized repository that provides a structured forum for data analysis. We present an efficient approximation algorithm for identifying clusters that form multipartite cliques spanning multiple database tables. We show that our algorithm computes a lossless compression of the database by summarizing it into a reduced set of biologically meaningful clusters. Our algorithm is applied to data from C. elegans, but we note its applicability to general relational databases.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

36

Aridhi, Sabeur. "Distributed frequent subgraph mining in the cloud." Phd thesis, Université Blaise Pascal - Clermont-Ferrand II, 2013. http://tel.archives-ouvertes.fr/tel-00951350.

Full text

Abstract:

Recently, graph mining approaches have become very popular, especially in certain domains such as bioinformatics, chemoinformatics and social networks. One of the most challenging tasks in this setting is frequent subgraph discovery. This task has been highly motivated by the tremendously increasing size of existing graph databases. Due to this fact, there is urgent need of efficient and scaling approaches for frequent subgraph discovery especially with the high availability of cloud computing environments. This thesis deals with distributed frequent subgraph mining in the cloud. First, we provide the required material to understand the basic notions of our two research fields, namely graph mining and cloud computing. Then, we present the contributions of this thesis. In the first axis, we propose a novel approach for large-scale subgraph mining, using the MapReduce framework. The proposed approach provides a data partitioning technique that consider data characteristics. It uses the densities of graphs in order to partition the input data. Such a partitioning technique allows a balanced computational loads over the distributed collection of machines and replace the default arbitrary partitioning technique of MapReduce. We experimentally show that our approach decreases significantly the execution time and scales the subgraph discovery process to large graph databases. In the second axis, we address the multi-criteria optimization problem of tuning thresholds related to distributed frequent subgraph mining in cloud computing environments while optimizing the global monetary cost of storing and querying data in the cloud. We define cost models for managing and mining data with a large scale subgraph mining framework over a cloud architecture. We present an experimental validation of the proposed cost models in the case of distributed subgraph mining in the cloud.

APA, Harvard, Vancouver, ISO, and other styles

37

Kumar, Lalit. "Scalable Map-Reduce Algorithms for Mining Formal Concepts and Graph Substructures." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1543996580926452.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Wegner, Jörg Kurt. "Data mining und graph mining auf molekularen Graphen - Cheminformatik und molekulare Kodierungen für ADME/Tox-QSAR-Analysen." Berlin Logos, 2006. http://deposit.d-nb.de/cgi-bin/dokserv?id=2865580&prov=M&dok_var=1&dok_ext=htm.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Chen, Zhiqian. "Graph Neural Networks: Techniques and Applications." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/99848.

Full text

Abstract:

Effective information analysis generally boils down to the geometry of the data represented by a graph. Typical applications include social networks, transportation networks, the spread of epidemic disease, brain's neuronal networks, gene data on biological regulatory networks, telecommunication networks, knowledge graph, which are lying on the non-Euclidean graph domain. To describe the geometric structures, graph matrices such as adjacency matrix or graph Laplacian can be employed to reveal latent patterns. This thesis focuses on the theoretical analysis of graph neural networks and the development of methods for specific applications using graph representation. Four methods are proposed, including rational neural networks for jump graph signal estimation, RemezNet for robust attribute prediction in the graph, ICNet for integrated circuit security, and CNF-Net for dynamic circuit deobfuscation. For the first method, a recent important state-of-art method is the graph convolutional networks (GCN) nicely integrate local vertex features and graph topology in the spectral domain. However, current studies suffer from drawbacks: graph CNNs rely on Chebyshev polynomial approximation which results in oscillatory approximation at jump discontinuities since Chebyshev polynomials require degree $Omega$(poly(1/$epsilon$)) to approximate a jump signal such as $|x|$. To reduce complexity, RatioanlNet is proposed to integrate rational function and neural networks for graph node level embeddings. For the second method, we propose a method for function approximation which suffers from several drawbacks: non-robustness and infeasibility issue; neural networks are incapable of extracting analytical representation; there is no study reported to integrate the superiorities of neural network and Remez. This work proposes a novel neural network model to address the above issues. Specifically, our method utilizes the characterizations of Remez to design objective functions. To avoid the infeasibility issue and deal with the non-robustness, a set of constraints are imposed inspired by the equioscillation theorem of best rational approximation. The third method proposes an approach for circuit security. Circuit obfuscation is a recently proposed defense mechanism to protect digital integrated circuits (ICs) from reverse engineering. Estimating the deobfuscation runtime is a challenging task due to the complexity and heterogeneity of graph-structured circuit, and the unknown and sophisticated mechanisms of the attackers for deobfuscation. To address the above-mentioned challenges, this work proposes the first graph-based approach that predicts the deobfuscation runtime based on graph neural networks. The fourth method proposes a representation for dynamic size of circuit graph. By analyzing SAT attack method, a conjunctive normal form (CNF) bipartite graph is utilized to characterize the complexity of this SAT problem. To overcome the difficulty in capturing the dynamic size of the CNF graph, an energy-based kernel is proposed to aggregate dynamic features.
Doctor of Philosophy
Graph data is pervasive throughout most fields, including pandemic spread network, social network, transportation roads, internet, and chemical structure. Therefore, the applications modeled by graph benefit people's everyday life, and graph mining derives insightful opinions from this complex topology. This paper investigates an emerging technique called graph neural newton (GNNs), which is designed for graph data mining. There are two primary goals of this thesis paper: (1) understanding the GNNs in theory, and (2) apply GNNs for unexplored and values real-world scenarios. For the first goal, we investigate spectral theory and approximation theory, and a unified framework is proposed to summarize most GNNs. This direction provides a possibility that existing or newly proposed works can be compared, and the actual process can be measured. Specifically, this result demonstrates that most GNNs are either an approximation for a function of graph adjacency matrix or a function of eigenvalues. Different types of approximations are analyzed in terms of physical meaning, and the advantages and disadvantages are offered. Beyond that, we proposed a new optimization for a highly accurate but low efficient approximation. Evaluation of synthetic data proves its theoretical power, and the tests on two transportation networks show its potentials in real-world graphs. For the second goal, the circuit is selected as a novel application since it is crucial, but there are few works. Specifically, we focus on a security problem, a high-value real-world problem in industry companies such as Nvidia, Apple, AMD, etc. This problem is defined as a circuit graph as apply GNN to learn the representation regarding the prediction target such as attach runtime. Experiment on several benchmark circuits shows its superiority on effectiveness and efficacy compared with competitive baselines. This paper provides exploration in theory and application with GNNs, which shows a promising direction for graph mining tasks. Its potentials also provide a wide range of innovations in graph-based problems.

APA, Harvard, Vancouver, ISO, and other styles

40

Boothe, Peter Mattison 1978. "Measuring the Internet AS graph and its evolution." Thesis, University of Oregon, 2009. http://hdl.handle.net/1794/10328.

Full text

Abstract:

xiv, 183 p. : ill. A print copy of this thesis is available through the UO Libraries. Search the library catalog for the location and call number.
As the Internet has evolved over time, the interconnection patterns of the members of this "network of networks" have changed. Can we characterize those changes? Have those changes been good or bad? What does "good" mean in this context? Has market power been centralizing or decentralizing? How certain can we be of our answer? What are the limitations of our data? These are the questions which motivate this dissertation. In this dissertation, we answer these questions and more by carefully taking a long-term quantitative study of the evolution of the topology of the Internet's AS graph. In order to do this study, we spend most of the dissertation developing methods of data processing and data analysis all informed by ideas from networking, data mining, graph theory, and statistics. The contributions are both theoretical and practical. The theoretical contributions include an in-depth analysis of the complexity of AS graph measurement as well as of the difficulty of reconstructing the AS graph from available data. The practical contributions include the design of graph metrics to capture properties of interest, usable approximation algorithms for several AS graph analysis methods, and an analysis of the evolution of the AS graph over time. It is our hope that these methods may prove useful in other domains, and that the conclusions about the evolution of the Internet topology prove useful for Internet operators, network researchers, policy makers, and others.
Committee in charge: Andrzej Proskurowski, Chairperson, Computer & Information Science; Arthur Farley, Member, Computer & Information Science; Jun Li, Member, Computer & Information Science; Anne van den Nouweland, Outside Member, Economics

APA, Harvard, Vancouver, ISO, and other styles

41

Hassanzadeh, Reza. "Anomaly detection in online social networks : using data-mining techniques and fuzzy logic." Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/78679/1/Reza_Hassanzadeh_Thesis.pdf.

Full text

Abstract:

This research is a step forward in improving the accuracy of detecting anomaly in a data graph representing connectivity between people in an online social network. The proposed hybrid methods are based on fuzzy machine learning techniques utilising different types of structural input features. The methods are presented within a multi-layered framework which provides the full requirements needed for finding anomalies in data graphs generated from online social networks, including data modelling and analysis, labelling, and evaluation.

APA, Harvard, Vancouver, ISO, and other styles

42

Hohman, Elizabeth Leeds. "A dynamic graph model for representing streaming text documents." Fairfax, VA : George Mason University, 2008. http://hdl.handle.net/1920/3062.

Full text

Abstract:

Thesis (Ph.D.)--George Mason University, 2008.
Vita: p. 141. Thesis director: Edward J. Wegman. Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computational Sciences and Informatics. Title from PDF t.p. (viewed July 3, 2008). Includes bibliographical references (p. 105-110). Also issued in print.

APA, Harvard, Vancouver, ISO, and other styles

43

Cakmak, Ali. "Mining Metabolic Networks and Biomedical Literature." Case Western Reserve University School of Graduate Studies / OhioLINK, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=case1223490345.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Hirao, Eiji, Takeshi Furuhashi, Tomohiro Yoshikawa, and Daisuke Kobayashi. "A Study of Visualization Method with HK Graph Using Concept Words." 日本知能情報ファジィ学会, 2010. http://hdl.handle.net/2237/20687.

Full text

Abstract:

Session ID: TH-B1-3
SCIS & ISIS 2010, Joint 5th International Conference on Soft Computing and Intelligent Systems and 11th International Symposium on Advanced Intelligent Systems. December 8-12, 2010, Okayama Convention Center, Okayama, Japan

APA, Harvard, Vancouver, ISO, and other styles

45

Gawronski, Alexander. "RiboFSM: Frequent Subgraph Mining for the Discovery of RNA Structures and Interactions." Thèse, Université d'Ottawa / University of Ottawa, 2013. http://hdl.handle.net/10393/26296.

Full text

Abstract:

Frequent subgraph mining is a useful method for extracting biologically relevant patterns from a set of graphs or a single large graph. Here, the graph represents all possible RNA structures and interactions. Patterns that are significantly more frequent in this graph over a random graph are extracted. We hypothesize that these patterns are most likely to represent a biological mechanisms. The graph representation used is a directed dual graph, extended to handle intermolecular interactions. The graph is sampled for subgraphs, which are labeled using a canonical labeling method and counted. The resulting patterns are compared to those created from a randomized dataset and scored. The algorithm was applied to the mitochondrial genome of the kinetoplastid species Trypanosoma brucei. This species has a unique RNA editing mechanism that has been well studied, making it a good model organism to test RiboFSM. The most significant patterns contain two stem-loops, indicative of gRNA, and represent interactions of these structures with target mRNA.

APA, Harvard, Vancouver, ISO, and other styles

46

Frisoni, Giacomo. "A New Unsupervised Methodology of Descriptive Text Mining for Knowledge Graph Learning." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20471/.

Full text

Abstract:

Rare diseases pose particular challenges to patients, families, caregivers, clinicians and researchers. Currently, more than 6000 rare diseases are described (but up to 7000 are estimated) and more than 350 million people live with them (5\% of the world population). Due to the scarce availability of information and their disintegration, in recent years we are witnessing strong growth of patient communities on social platforms such as Facebook. The work presented in this thesis is intended to extract knowledge from the large availability of unstructured text generated by the users over time, in order to represent it in an organized way and to make logical reasoning above. Starting from the awareness of the need to integrate different methodologies in complex domains, the research shows a combined use of Text Mining and Semantic Web techniques, taking Esophageal Achalasia as a case study. In particular, an ontology is created to extend ORDO and introduce a patient-centered vision into the world of linked data. The significance of this development is that it potentially constitutes the basis of a project that can allow rapid access to many high-value information (in topics such as symptomatology, epidemiology, diagnosis, treatments, drugs, nutrition, lifestyle), responding to patients' questions and providing them with an additional tool for decision making, minimizing costs through the automatic retrieval of these data and increasing the productivity of investigators.

APA, Harvard, Vancouver, ISO, and other styles

47

Viaño, Iglesias Adrian. "Graph representation of documents content and its suitability for text mining tasks." Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, 2011. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-14133.

Full text

Abstract:

Association rules mining is one of the the most relevant techniques of data mining. It has been also applied in the domain of text mining, but the results are hard to interpret. In this matter, an Association Network is an structure to represent as a graph the relationships mined as association rules. The goal of this project was to provide a methodology to build association networks from concepts extracted from a collection of documents, as well as the study of the mathematical properties of the association networks to prove that they are not random graphs and that they exhibit small-world properties.

APA, Harvard, Vancouver, ISO, and other styles

48

Shaban, Khaled. "A Semantic Graph Model for Text Representation and Matching in Document Mining." Thesis, University of Waterloo, 2006. http://hdl.handle.net/10012/2860.

Full text

Abstract:

The explosive growth in the number of documents produced daily necessitates the development of effective alternatives to explore, analyze, and discover knowledge from documents. Document mining research work has emerged to devise automated means to discover and analyze useful information from documents. This work has been mainly concerned with constructing text representation models, developing distance measures to estimate similarities between documents, and utilizing that in mining processes such as document clustering, document classification, information retrieval, information filtering, and information extraction.

Conventional text representation methodologies consider documents as bags of words and ignore the meanings and ideas their authors want to convey. It is this deficiency that causes similarity measures to fail to perceive contextual similarity of text passages due to the variation of the words the passages contain, or at least perceive contextually dissimilar text passages as being similar because of the resemblance of words the passages have.

This thesis presents a new paradigm for mining documents by exploiting semantic information of their texts. A formal semantic representation of linguistic inputs is introduced and utilized to build a semantic representation scheme for documents. The representation scheme is constructed through accumulation of syntactic and semantic analysis outputs. A new distance measure is developed to determine the similarities between contents of documents. The measure is based on inexact matching of attributed trees. It involves the computation of all distinct similarity common sub-trees, and can be computed efficiently. It is believed that the proposed representation scheme along with the proposed similarity measure will enable more effective document mining processes.

The proposed techniques to mine documents were implemented as vital components in a mining system. A case study of semantic document clustering is presented to demonstrate the working and the efficacy of the framework. Experimental work is reported, and its results are presented and analyzed.

APA, Harvard, Vancouver, ISO, and other styles

49

Zulfiqar, Omer. "Detecting Public Transit Service Disruptions Using Social Media Mining and Graph Convolution." Thesis, Virginia Tech, 2021. http://hdl.handle.net/10919/103745.

Full text

Abstract:

In recent years we have seen an increase in the number of public transit service disruptions due to aging infrastructure, system failures and the regular need for maintenance. With the fleeting growth in the usage of these transit networks there has been an increase in the need for the timely detection of such disruptions. Any types of disruptions in these transit networks can lead to delays which can have major implications on the daily passengers. Most current disruption detection systems either do not operate in real-time or lack transit network coverage. The theme of this thesis was to leverage Twitter data to help in earlier detection of service disruptions. This work involves developing a pure Data Mining approach and a couple different approaches that use Graph Neural Networks to identify transit disruption related information in Tweets from a live Twitter stream related to the Washington Metropolitan Area Transit Authority (WMATA) metro system. After developing three different models, a Dynamic Query Expansion model, a Tweet-GCN and a Tweet-Level GCN to represent the data corpus we performed various experiments and benchmark evaluations against other existing baseline models, to justify the efficacy of our approaches. After seeing astounding results across both the Tweet-GCN and Tweet-Level GCN, with an average accuracy of approximately 87.3% and 89.9% we can conclude that not only are these two graph neural models superior for basic NLP text classification, but they also outperform other models in identifying transit disruptions.
Master of Science
Millions of people worldwide rely on public transit networks for their daily commutes and day to day movements. With the growth in the number of people using the service, there has been an increase in the number of daily passengers affected by service disruptions. This thesis and research involves proposing and developing three different approaches to help aid in the timely detection of these disruptions. In this work we have developed a pure data mining approach along with two deep learning models using neural networks and live data from Twitter to identify these disruptions. The data mining approach uses a set of dirsuption related input keywords to identify similar keywords within the live Twitter data. By collecting historical data we were able to create deep learning models that represent the vocabulary from the disruptions related Tweets in the form of a graph. A graph is a collection of data values where the data points are connected to one another based on their relationships. A longer chain of connection between two words defines a weak relationship, a shorter chain defines a stronger relationship. In our graph, words with similar contextual meanings are connected to each other over shorter distances, compared to words with different meanings. At the end we use a neural network as a classifier to scan this graph to learn the semantic relationships within our data. Afterwards, this learned information can be used to accurately classify the disruption related Tweets within a pool of random Tweets. Once all the proposed approaches have been developed, a benchmark evaluation is performed against other existing text classification techniques, to justify the effectiveness of the approaches. The final results indicate that the proposed graph based models achieved a higher accuracy, compared to the data mining model, and also outperformed all the other baseline models. Our Tweet-Level GCN had the highest accuracy of 89.9%.

APA, Harvard, Vancouver, ISO, and other styles

50

Cederquist, Aaron. "Frequent Pattern Mining among Weighted and Directed Graphs." Case Western Reserve University School of Graduate Studies / OhioLINK, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=case1228328123.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Graph Mining'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles