edX @ From Graph to Knowledge Graph – Algorithms and Applications / Graph Properties and Applications

Microsoft: DAT278x: From Graph to Knowledge Graph – Algorithms and Applications

Module 2: Graph Properties and Applications

Graph basics

Our everyday life has been strongly connected in the form of graphs or networks. It is therefore timely critical to study how we connect to each other and how digitalized human connections in social graph affect our humanity.

Graph history

Let's get back to history, the concept of graphs can be traced back to the the year 1736, the year in which, Euler introduced the problem "Seven Bridges of Konigsberg." The puzzle that Euler was interested in is then transformed as the following question, can we find a path transversing from node to node by following the green lines with the requirement that each line is passed through exactly once?

After the introduction of the world's first graph problem, let us look at when the term graph started. It was first documented by James Sylvester in 19th century. Formally a graph is denoted as a G where covers two sets of objects; one is the node set V and the other is edge set E. Each edge connect two nodes in the nodes set V.

The number of nodes in a graph G is then called the order of the graph.

The number of edge is called the size of the graph.

Formally, given a graph we can represent it as a two dimensional matrix A with each row and each column corresponding to a node. (Matriz de Adjacência)

Each element A{i,j} then represent linkage between node i and node j.

Graph Type

Formally, an undirected graph means that when edge eij belongs to the edge set E, then eji also belongs to the edge set. For a directed graph, when eij belongs to the edge set E, it is not necessary that eji also belongs to E.

Furthermore, when a graph has no self-loops at each node and also has no multiple edge between any two nodes, we call it a simple graph.

When a graph has multiple edges but has no self loops, we call it a multigraph.

Graph G is a K-partite graph if its node set can be partitioned into K groups starting at every edge in the Graph G has its two ending nodes in different groups. Notice that K is required to be larger than or equal to two.

Basic graph properties

Basically, the degree of node is defined as the number of its neighbors in the graph. Formally, we can also get each node's degree from graph adjacency matrix.The sum of each row or column is actually the degree of the corresponding node.

Out-degree of one node v_i represent the number of neighbors connected from v_i, and the in-degree of v_i is to counts the number of neighbors that point to v_i.

With the concept of node degree, let us look at the first theorem of graph theory. It is called the Handshaking Lemma, which was first introduced by Euler in the same year, 1736. Basically, the Handshaking Lemma says that the sum of all node's degree in a (simple?) graph is equal to two times number of edges.

From the Handshaking Lemma, we can also conclude that every graph has an even number of vertices with odd degree (um número par de vértices com grau ímpar).

Centrality

In graph, one of the important problem is to measure the importance of nodes. This measurement is usually called node centrality. Node degree centrality indicates that with more neighbors or connections, more important this node is. Therefore, we can rank all nodes based on their degree.

To address this issue (of node degree centrality), the clustering coefficient centrality is introduced. It codifies how one node's neighbors are connected with each other. Formally, it is defined as the ratio between two terms:

Different from node degree centrality, clustering coefficient look at the connectivity of one's neighbor.

The neighborhood connectivity centrality is formally defined as average degree of one's neighbors.

Eigenvector centralities

To formalize this repeating process (to calculate neighborhood connectivity centrality), the Eigenvector centrality is introduced next. First, let us use Xi to denote importance of a node Vi and let us iteratively achieve Xi by averaging over its neighbor's importance Xj, where it's neighbor's importance is also achieved in the same iterative way.

Matrix "A" is the graph adjacency maxtrix .

Based on the simple linear algebra, we know that X is one of Matrix A's eigenvector and lambda is its associated eigenvalue. By definition, the importance of each node is required to be non-negative, leaving lambda to be the largest eigenvalue of Matrix A.

The eigenvector centrality works perfect for undirected graphs. However, it does not work well for directed acyclic graphs (DAG).

One famous solution to this issue is the PageRank centrality. The first step of PageRank is very similar to the Eigenvector centrality, that is, to sum over one's neighbors importance. The difference lies in the treatment to which neighbor's importance by the number of out-links each neighbor has.

HITS centrality

Another famous node centrality for directed graph. It is called the HITS algorithm, short for hyperlink-induced topic search algorithm which was introduced by Jon Kleinberg.

The web directory page serve as the information hub. Therefore for each webpage it is associated with two important centrality scores. One is Hubness centrality. The other one is Authority centrality.

Therefore, the authority of a node v_i is defined as the sum over its in-link neighbors' Hubnuss scores.

Formally, its' hubness centrality is defined as the sum or as out-link-neighbors authority centrality.

It means that the two matrix A times A transform and A transform times A have the same eigenvalues. With the same largest eigenvalue lambda, the authority centrality is then eigenvector of matrix A times matrix A transform. And the hub centrality is the eigenvector of matrix A transform times matrix A.

Graph theory

Graphical realization problem: can all integer sequence form a graph?

A sequence of non negative integers, called the graph's degree sequence T, is graphical if a simple graph G can be constructed with its degree sequence as an integer list.

The theoretical problem we care about here is how can we tell whether a sequence is graphical or not.
Then the next question is if one sequence is indeed graphical how can we have its graphical realization?

Graph isomorphism problem:

Basically, this problem cares about the question of whether two graphs have the same structure.

Graph applications

Node label classification

We are given a graph with node set V and edge set E. Some nodes are associated with labels and for the remaining nodes we don't know their label information. So goal is to infer those unknown nodes' labels.

Usually this problem is formulized into a data mining task such as classification and regression. (Existem outras abordagens para tratar esse problema?)

When inferring discrete node labels such as gender prediction, it is a classification task.
When inferring continuous node labels such as users' age, it is a regression task.
Either way, as a data mining task, the common practice is to define, and then extract node features from the graph's structures. With the feature matrix X constructed, we can then learn a classification model, such as logistic regression, support vector machine and the random forest. If the node labels are continuous values, we can instead learn a regression model. With the trained models, we can then infer those unknown nodes' labels.

Node label classification in MAG

Shen, Ma, Wang. A Web-scale system for scientific knowledge exploration. In ACL 2018.

Podem ser definidas regras para inferir node labs como, por exemplo, SE X tem genero Fem e X é casado como Y ENTÃO Y tem gênero Masc

Community detection

The community detection is referred to as the node clustering task in the context of data mining or machine learning. Community detection is key to understand the structures and the organization of a graph. A community in graph is usually considered as one set of nodes that can be grouped together easily.

How do we detect communities? One of the common practice is to transform the community detection problem as a node clustering task in data mining. Basically we follow the same procedure in node label classification to extract features for each node. Then we can apply the clustering algorithm such as k-means to the feature matrix in that way we can class these nodes into several clusters.
Each cluster corresponding to one community.

The basic criteria is to minimize number of connections between two communities and at the same time maximize numbers of connections within each community.

Reduzir o número de arestas entre nós de grupos diferentes.

The spectral graph clustering techniques, which also serves as a theoretical understanding of community detection in graphs.

First, let us introduce graph laplacian matrix. It is achieved by two matrix. The first one is adjacency matrix A and the second one is the degree matrix D. The degree matrix is an n times n diagonal matrix, where n is the number of nodes in the graph. Each element on the diagonal line simply represents the degree of one node in the graph.

A matriz laplaciana é o resultado da matriz de graus (matriz diagonal) menos a matriz de adjacencias (matriz simétrica).

It is a symmetric matrix.
The sum of each row and each column is zero.
The graph laplacian matrix is positive semi definite and it's eigenvalues are non-negative and as a result it's eigenvectors are real and orthogonal.

Given the laplacian matrix of graph we can leverage it to partition the graph into two communities. We first need to find it's eigenvalues lambda and it's corresponding eigenvectors, x by eigenvalue decomposition. It turns out that in this specific eigenvector, some of it's entries are positive and some of them are negative. Now usually we can split nodes into two clusters by using zero as a splitting point.

Em problemas do mundo real pode ser necessário separar em mais de dois grupos então:

How do we define the number of communities in a graph?
How do we partition a graph into multiple communities?

É possível particionar em dois grupos e depois repetir recursivamente para cada grupo até não encontrar mais um particionamento possível

Link prediction problem

Recomendar itens em sistemas de recomendação, conecções em redes sociais

Given two nodes vi and vj that are not connected in the graph right now, we would like to infer whether a link will form between these two nodes. The most effective way to predict whether a link will form between two nodes is to measure the structural similarity between these two nodes.

Similaridade pode ser a medida do número de vizinhos em comum entre dois nós (ou a razão entre a interseção dos vizinhos e a união dos vizinhos)

Another proposed node similarity metric to penalize common neighbors with large degree centrality. For example, the Adamic Adar node similarity.

Ou seja, se os vizinhos em comum entre dois nós i e j tiverem muitos vizinhos (alto grau de centralidade) isso não implica que deve haver uma ligação entre i e j (os vizinhos em comum podem ser hubs)

Se os vizinhos em comum estiverem conectados, existe uma grande chance que deve haver uma ligação entre i e j

Heterogeneous link prediction problem

Microsoft Academic Graph: how to predict author collaboration links between different fields? For example, the circle nodes represents computer science authors, and the square nodes represent the physics authors. We know the collaboration that within each of these two fields. The goal is then to infer the interdisciplinary collaborations between computer science and physics authors.

Microsoft Academic Graph (MAG), where authors are connected to each other if they collaborate on a research publication, which naturally connected authors and publications as well. Each author is also affiliated with their institutions, linking the author nodes with affiliation nodes as well. Research papers are published with different topics at academic venues, such as journals and conference proceedings. Therefore, the paper nodes can also be connected with the journal nodes, conference nodes, and fields of study nodes. In author collaboration graph, the degree of author node represented the number of collaborators each author has. In paper citation graph, the in-degree of a paper node represents number of citations it collects. The out-degree of this paper node represents the number of reference it covers

Em álgebra linear, um escalar λ diz-se um autovalor de um operador linear A : V->V se existir um vetor x diferente de zero tal que Ax=λx. O vetor x é chamado autovetor. Os autovalores de uma dada matriz quadrada A de dimensão n x n são os n números que resumem as propriedades essenciais daquela matriz. O autovalor de A é um número λ tal que, se for subtraído de cada entrada na diagonal de A, converte A numa matriz singular (ou não-invertível). Subtrair um escalar λ de cada entrada na diagonal de A é o mesmo que subtrair λ vezes a matriz identidade I de A. Portanto, λ é um autovalor se, e somente se, a matriz (A - λI) for singular.

https://pt.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-example-solving-for-the-eigenvalues-of-a-2x2-matrix

https://pt.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-eigenvalues-of-a-3x3-matrix

https://pt.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-eigenvectors-and-eigenspaces-for-a-3x3-matrix

Pesquisa de Doutorado da Veronica

Pesquisar este blog

edX @ From Graph to Knowledge Graph – Algorithms and Applications / Graph Properties and Applications

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

Exploratory Search: From Finding to Understanding - Leitura de Artigo