Pular para o conteúdo principal

Graph embedding techniques, applications, and performance: A survey - Leitura de Artigo

Graph embedding

An embedding therefore maps each node to a low-dimensional feature vector and tries to preserve the connection strengths between vertices. The idea for embedding was to keep connected nodes closer to each other in the vector space.

Challenges

1) features to be preserved: A “good” vector representation of nodes should preserve the structure of the graph and the connection between individual nodes.
2) Scalability
3) Dimensionality

Edge weights sij are also called first-order proximities between nodes vi and vj, since they are the first and foremost measures of similarity between two nodes.

Let si = [si1, ..., sin] denote the first-order proximity between vi and other nodes. Then, second-order proximity between vi and vj is determined by the similarity of si and sj. Second-order proximity compares the neighborhood of two nodes and treats them as similar if they have a similar neighborhood.

Categories of embeddings generation approaches

(1) Factorization based, (2) Random Walk based, and (3) Deep Learning based. 

1. Factorization based algorithms represent the connections between nodes in the form of a matrix and factorize this matrix to obtain the embedding.

Graph Factorization was the first method to obtain a graph embedding in O(|E|) time.

For the purpose of dimensionality reduction of high dimensional data, there are several other methods developed capable of performing graph embedding.

2. Random walks have been used to approximate many properties in the graph including node centrality and similarity. They are especially useful when one can either only partially observe the graph, or the graph is too large to measure in its entirety. Embedding techniques using random walks on graphs to obtain node representations have been proposed: DeepWalk and node2vec are two examples.

DeepWalk preserves higher-order proximity between nodes by maximizing the probability of observing the last k nodes and the next k nodes in the random walk centered at vi. Similar to DeepWalk, node2vec preserves higher-order proximity between nodes by maximizing the probability of occurrence of subsequent nodes in fixed length random walks. The crucial difference from DeepWalk is that node2vec employs biased-random walks that provide a trade-off between breadth-first (BFS) and depth-first (DFS) graph searches, and hence produces higher-quality and more informative embeddings than DeepWalk.

3. The growing research on deep learning has led to a deluge of deep neural networks based methods applied to graphs. Deep autoencoders have been used for dimensionality reduction due to their ability to model non-linear structure in the data.

Factorization-based methods are not capable of learning an arbitrary function, e.g., to explain network connectivity. Thus, unless explicitly included in their objective function, they cannot learn structural equivalence. In random walk based methods, the mixture of equivalences can be controlled to a certain extent by varying the random walk parameters. Deep learning methods can model a wide range of functions following the universal approximation theorem: given enough parameters, they can learn the mix of community and structural equivalence, to embed the nodes such that the reconstruction error is minimized.

Graph simplification

network compression (a.k.a. graph simplification): for a graph G, they defined a compression G* which has smaller number of edges. The goal was to store the network more efficiently and run graph analysis algorithms faster.

Similar to these representations, graph embedding can also be interpreted as a summarization of graph ... a low dimensional representation for each node (in the order of 100s) suffices to reconstruct the graph with high precision.

As embedding represents a graph in a vector space, dimensionality reduction techniques like Principal Component Analysis (PCA)

Graph analytic tasks

(a) node classification: methods which use random walks to propagate the labels and methods which extract features from nodes and apply classifiers on them,
(b) link prediction: similarity based methods, maximum likelihood models, and probabilistic models,
(c) clustering: attribute based models and methods which directly maximize (resp., minimize) the inter-cluster (resp., intra-cluster) distances, and
(d) visualization.

Graph clustering (a.k.a., network partitioning) can be of two types: (a) structure based, and (b) attribute based clustering. The former can be further divided into two categories, namely community based, and structurally equivalent clustering. Structure-based methods, aim to find dense subgraphs with high number of intra-cluster edges, and low number of inter-cluster edges. Structural equivalence clustering, on the contrary, is designed to identify nodes with similar roles (like bridges and outliers). Attribute based methods utilize node labels, in addition to observed links, to cluster nodes.

Link prediction refers to the task of predicting either missing interactions or links that may appear in the future in an evolving network. Link prediction is pervasive in biological network analysis, where verifying the existence of links between nodes requires costly experimental tests. Limiting the experiments to links ordered by presence likelihood has been shown to be very cost effective.

Often in networks, a fraction of nodes are labeled. In social networks, labels may indicate interests, beliefs, or demographics. In language networks, a document may be labeled with topics or keywords, whereas the labels of entities in biology networks may be based on functionality. Due to various factors, labels may be unknown for large fractions of nodes. For example, in social networks many users do not provide their demographic information due to privacy concerns. Missing labels can be inferred using the labeled nodes and the links in the network. The task of predicting these missing labels is also known as node classification.

Bhagat et al. survey the methods used in the literature for this task. They classify the approaches into two categories, i.e., feature extraction based and random walk based. Feature-based models generate features for nodes based on their neighborhood and local network statistics and then apply a classifier like Logistic Regression and Naive Bayes to predict the labels. Random walk based models propagate the labels with random walks.

Embeddings can be interpreted as automatically extracted node features based on network structure and thus falls into the first category. Recent work has evaluated the predictive power of embedding on various information networks including language, social, biology and collaboration graphs. They show that embeddings can predict missing labels with high precision.


PROTEIN-PROTEIN INTERACTIONS (PPI): This is a network of biological interactions between proteins in humans. This network has 3890 nodes and 38,739 edges.

B.-J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R. Oughtred, D.H. Lackner, J. Bähler, V. Wood, et al.
The biogrid interaction database: 2008 update
Nucleic Acids Res., 36 (suppl 1) (2008), pp. D637-D640


open-source Python library named GEM (Graph Embedding Methods, available at https://github.com/palash1992/GEM), which provides all presented algorithms within a unified interface to foster and facilitate research on the topic.

Comentários

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s...

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russe...

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The...