Fonte: https://www.cs.mcgill.ca/~wlh/grl_book/files/GRL_Book-Chapter_2-Background.pdf
@article{
author={Hamilton, William L.},
title={Graph Representation Learning},
journal={Synthesis Lectures on Artificial Intelligence and Machine Learning},
volume={14},
number={3},
pages={1-159},
publisher={Morgan and Claypool}
}
What kinds of methods were used for machine learning on graphs prior to the advent of modern deep learning approaches?
Traditional approaches to classification using graph data follow the standard machine learning paradigm that was popular prior to the advent of deep learning. We begin by extracting some statistics or features - based on heuristic functions or domain knowledge - and then use these features as input to a standard machine learning classifier (e.g., logistic regression)
Node-level statistics and features
Graph statistics, kernel methods, and their use for node and graph classification tasks.
Node degree. The most obvious and straightforward node feature to examine is degree. Node degree simply measures how many neighbors a node has, but this is not necessarily sufficient to measure the importance of a node in a graph.To obtain a more powerful measure of importance, we can consider various measures of what is known as node centrality, which can form useful features in a wide variety of node classification tasks.
One popular and important measure of centrality is the so-called eigenvector centrality. Whereas degree simply measures how many neighbors each node has, eigenvector centrality also takes into account how important a node’s neighbors are.
There are, of course, other measures of centrality that we could use to characterize the nodes in this graph ... which measures how often a node lies on the shortest path between two other nodes - as well as closeness centrality - which measures the average shortest path length between a node and all other nodes.
This important structural distinction can be measured using variations of the clustering coefficient, which measures the proportion of closed triangles in a node’s local neighborhood.
The numerator in this equation counts the number of edges between neighbours of node u (where we use N(u) to denote the node neighborhood). The denominator calculates how many pairs of nodes there are in u’s neighborhood (degree). The clustering coeff takes its name from the fact that it measures how
tightly clustered a node’s neighborhood is. A clustering coeff of 1 would imply that all of u’s neighbors are also neighbors of each other.
tightly clustered a node’s neighborhood is. A clustering coeff of 1 would imply that all of u’s neighbors are also neighbors of each other.
An interesting and important property of real-world networks throughout the social and biological sciences is that they tend to have far higher clustering coeff than one would expect if edges were sampled randomly [Watts and Strogatz, 1998].
Graph-level features and graph kernels
These node and graph-level statistics are useful for many classification tasks. However, they are limited in that they do not quantify the relationships between nodes.Approaches for measuring the overlap between node neighborhoods, which form the basis of strong heuristics for relation prediction.
graph kernel methods: approaches to designing features for graphs or implicit kernel functions that can be used in machine learning models, methods that extract explicit feature representations, rather than approaches that define implicit kernels (i.e., similarity measures) between graphs.
Bag of nodes: The simplest approach to defining a graph-level feature is to just aggregate node-level statistics. For example, one can compute histograms or other summary statistics based on the degrees, centralities, and clustering coeff of the nodes in the graph. This aggregated information can then be used as a graph-level representation.
One way to improve the basic bag of nodes approach is using a strategy of iterative neighborhood aggregation. The idea with these approaches is to extract node-level features that contain more information than just their local ego graph, and then to aggregate these richer features into a graph-level representation. Perhaps the most important and well-known of these strategies is the Weisfeiler-
Lehman (WL) algorithm and kernel [Shervashidze et al., 2011, Weisfeiler and Lehman, 1968]. The basic idea behind the WL algorithm is the following:
Lehman (WL) algorithm and kernel [Shervashidze et al., 2011, Weisfeiler and Lehman, 1968]. The basic idea behind the WL algorithm is the following:
First, we assign an initial label l(0)(v) to each node. In most graphs, this label is simply the degree, i.e., l(0)(v) = d(v) .
Next, we iteratively assign a new label to each node by hashing the multi-set of the current labels within the node’s neighborhood:
After running K iterations of re-labeling (i.e., Step 2), we now have a label l(K)(v) for each node that summarizes the structure of its K-hop neighborhood. We can then compute histograms or other summary statistics over these labels as a feature representation for the graph. In other words, the
WL kernel is computed by measuring the difference between the resultant label sets for two graphs.
WL kernel is computed by measuring the difference between the resultant label sets for two graphs.
simply count the occurrence of different small subgraph structures, usually called graphlets in this context. Formally, the graphlet kernel involves enumerating all possible graph structures of a particular size and counting how many times they occur in the full graph. ... An alternative to enumerating all possible graphlets is to use path-based methods. In these approaches, rather than enumerating graphlets, one simply examines the different kinds of paths that occur in the graph.
Local overlap measures are extremely effective heuristics for link prediction and often achieve competitive performance even compared to advanced deep learning approaches [Perozzi et al., 2014]. However, the local approaches are limited in that they only consider local node neighborhoods. Various statistical measures of neighborhood overlap between pairs of nodes, which quantify the extent to which a pair of nodes are related. For example, the simplest neighborhood overlap measure just counts the number of neighbors that two nodes share ...
Given a neighborhood overlap statistic S[u, v], a common strategy is to assume that the likelihood of an edge (u, v) is simply proportional to S[u, v]
normalizes the count of common neighbors by the sum of the node degrees. Normalization of some kind is usually very important; otherwise, the overlap measure would be highly biased towards predicting edges for nodes with large degrees
Jaccard
The Resource Allocation (RA) index counts the inverse degrees of the common neighbors ... while the Adamic-Adar (AA) index performs a similar computation using the inverse logarithm of the degrees.
Both these measures give more weight to common neighbors that have low
degree, with intuition that a shared low-degree neighbor is more informative
than a shared high-degree one
Both these measures give more weight to common neighbors that have low
degree, with intuition that a shared low-degree neighbor is more informative
than a shared high-degree one
For example, two nodes could have no local overlap in their neighborhoods but still be members of the
same community in the graph. Global overlap statistics attempt to take such relationships into account.
The Katz index is the most basic global overlap statistic. To compute the Katz index we simply count the number of paths of all lengths between a pair of nodes ... a user-defined parameter controlling how much weight is given to short versus long paths. A small value of (beta) would down-weight the
importance of long paths. One issue with the Katz index is that it is strongly biased by node degree.
importance of long paths. One issue with the Katz index is that it is strongly biased by node degree.
Another set of global similarity measures consider random walks rather than exact counts of paths over the graph. For example, we can directly apply a variant of the famous PageRank approach [Page et al., 1999] - known as the Personalized PageRank algorithm
Graph Laplacians and Spectral Methods
Spectral clustering is one of the most well-studied algorithms for clustering or community detection on graphs problem of learning to cluster the nodes in a graph.
Adjacency matrices can represent graphs without any loss of information. However, there are alternative matrix representations of graphs that have useful mathematical properties. These matrix representations are called Laplacians and are formed by various transformations of the adjacency matrix. The most basic Laplacian matrix is the unnormalized Laplacian, defined as L = D - A, where A is the adjacency matrix and D is the degree matrix. ... there are also two popular normalized variants of the Laplacian. The symmetric normalized Laplacian is defined as
Graph Cuts and Clustering
the eigenvectors corresponding to the 0 eigenvalue of the Laplacian can be used to assign nodes to clusters based on which connected component they belong to.
In order to motivate the Laplacian spectral clustering approach, we first must define what we mean by an optimal cluster. To do so, we define the notion of a cut on a graph. In other words, the cut is simply the count of how many edges cross the boundary between the partition of nodes. Now, one option to define an optimal clustering of the nodes into K clusters would be to select a partition that minimizes this cut value. There are efficient algorithms to solve this task, but a known problem with this approach is that it tends to simply make clusters that consist of a single node [Stoer and Wagner, 1997].
Thus, instead of simply minimizing the cut we generally seek to minimize the cut while also enforcing that the partitions are all reasonably large. In summary, the second-smallest eigenvector of the Laplacian is a continuous approximation to the discrete vector that gives an optimal cluster assignment
(with respect to the RatioCut).
the node and graph-level statistics are limited due to the fact that they require careful, hand-engineered statistics and measures. These hand-engineered features are inflexible - i.e., they cannot adapt through a learning process - and designing these features can be a time-consuming and expensive process.
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.