Graph embedding techniques, applications, and performance: A survey

Graph embedding techniques, applications, and performance: A survey - Leitura de Artigo

Graph embedding

An embedding therefore maps each node to a low-dimensional feature vector and tries to preserve the connection strengths between vertices. The idea for embedding was to keep connected nodes closer to each other in the vector space.

Challenges

1) features to be preserved: A “good” vector representation of nodes should preserve the structure of the graph and the connection between individual nodes.
2) Scalability
3) Dimensionality

Edge weights sij are also called first-order proximities between nodes vi and vj, since they are the first and foremost measures of similarity between two nodes.

Let si = [si1, ..., sin] denote the first-order proximity between vi and other nodes. Then, second-order proximity between vi and vj is determined by the similarity of si and sj. Second-order proximity compares the neighborhood of two nodes and treats them as similar if they have a similar neighborhood.

Categories of embeddings generation approaches

(1) Factorization based, (2) Random Walk based, and (3) Deep Learning based.

1. Factorization based algorithms represent the connections between nodes in the form of a matrix and factorize this matrix to obtain the embedding.

Graph Factorization was the first method to obtain a graph embedding in O(|E|) time.

For the purpose of dimensionality reduction of high dimensional data, there are several other methods developed capable of performing graph embedding.

2. Random walks have been used to approximate many properties in the graph including node centrality and similarity. They are especially useful when one can either only partially observe the graph, or the graph is too large to measure in its entirety. Embedding techniques using random walks on graphs to obtain node representations have been proposed: DeepWalk and node2vec are two examples.

DeepWalk preserves higher-order proximity between nodes by maximizing the probability of observing the last k nodes and the next k nodes in the random walk centered at vi. Similar to DeepWalk, node2vec preserves higher-order proximity between nodes by maximizing the probability of occurrence of subsequent nodes in fixed length random walks. The crucial difference from DeepWalk is that node2vec employs biased-random walks that provide a trade-off between breadth-first (BFS) and depth-first (DFS) graph searches, and hence produces higher-quality and more informative embeddings than DeepWalk.

3. The growing research on deep learning has led to a deluge of deep neural networks based methods applied to graphs. Deep autoencoders have been used for dimensionality reduction due to their ability to model non-linear structure in the data.

Factorization-based methods are not capable of learning an arbitrary function, e.g., to explain network connectivity. Thus, unless explicitly included in their objective function, they cannot learn structural equivalence. In random walk based methods, the mixture of equivalences can be controlled to a certain extent by varying the random walk parameters. Deep learning methods can model a wide range of functions following the universal approximation theorem: given enough parameters, they can learn the mix of community and structural equivalence, to embed the nodes such that the reconstruction error is minimized.

Graph simplification

network compression (a.k.a. graph simplification): for a graph G, they defined a compression G* which has smaller number of edges. The goal was to store the network more efficiently and run graph analysis algorithms faster.

Similar to these representations, graph embedding can also be interpreted as a summarization of graph ... a low dimensional representation for each node (in the order of 100s) suffices to reconstruct the graph with high precision.

As embedding represents a graph in a vector space, dimensionality reduction techniques like Principal Component Analysis (PCA)

Graph analytic tasks

(a) node classification: methods which use random walks to propagate the labels and methods which extract features from nodes and apply classifiers on them,
(b) link prediction: similarity based methods, maximum likelihood models, and probabilistic models,
(c) clustering: attribute based models and methods which directly maximize (resp., minimize) the inter-cluster (resp., intra-cluster) distances, and
(d) visualization.

Graph clustering (a.k.a., network partitioning) can be of two types: (a) structure based, and (b) attribute based clustering. The former can be further divided into two categories, namely community based, and structurally equivalent clustering. Structure-based methods, aim to find dense subgraphs with high number of intra-cluster edges, and low number of inter-cluster edges. Structural equivalence clustering, on the contrary, is designed to identify nodes with similar roles (like bridges and outliers). Attribute based methods utilize node labels, in addition to observed links, to cluster nodes.

Link prediction refers to the task of predicting either missing interactions or links that may appear in the future in an evolving network. Link prediction is pervasive in biological network analysis, where verifying the existence of links between nodes requires costly experimental tests. Limiting the experiments to links ordered by presence likelihood has been shown to be very cost effective.

Often in networks, a fraction of nodes are labeled. In social networks, labels may indicate interests, beliefs, or demographics. In language networks, a document may be labeled with topics or keywords, whereas the labels of entities in biology networks may be based on functionality. Due to various factors, labels may be unknown for large fractions of nodes. For example, in social networks many users do not provide their demographic information due to privacy concerns. Missing labels can be inferred using the labeled nodes and the links in the network. The task of predicting these missing labels is also known as node classification.

Bhagat et al. survey the methods used in the literature for this task. They classify the approaches into two categories, i.e., feature extraction based and random walk based. Feature-based models generate features for nodes based on their neighborhood and local network statistics and then apply a classifier like Logistic Regression and Naive Bayes to predict the labels. Random walk based models propagate the labels with random walks.

Embeddings can be interpreted as automatically extracted node features based on network structure and thus falls into the first category. Recent work has evaluated the predictive power of embedding on various information networks including language, social, biology and collaboration graphs. They show that embeddings can predict missing labels with high precision.

PROTEIN-PROTEIN INTERACTIONS (PPI): This is a network of biological interactions between proteins in humans. This network has 3890 nodes and 38,739 edges.

B.-J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone, R. Oughtred, D.H. Lackner, J. Bähler, V. Wood, et al.
The biogrid interaction database: 2008 update
Nucleic Acids Res., 36 (suppl 1) (2008), pp. D637-D640

open-source Python library named GEM (Graph Embedding Methods, available at https://github.com/palash1992/GEM), which provides all presented algorithms within a unified interface to foster and facilitate research on the topic.

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens: realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward) Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Pesquisa de Doutorado da Veronica

Pesquisar este blog

Graph embedding techniques, applications, and performance: A survey - Leitura de Artigo

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

KnOD 2021