RDF2Vec: RDF Graph Embeddings for Data Mining

RDF2Vec: RDF Graph Embeddings for Data Mining - Leitura de Artigo

Ristoski, P., Paulheim, H.: RDF2Vec: RDF Graph Embeddings for Data Mining.In: Proceedings of the 15th International Semantic Web Conference. ISWC’16(2016) 498–514

Abstract: Linked Open Data has been recognized as a valuable source for background information in data mining. However, most data mining tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature.

In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs.

Our evaluation shows that such vector representations outperform existing techniques for the propositionalization of RDF graphs on a variety of different predictive machine learning tasks, and that feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks

Vídeo da apresentação -> http://videolectures.net/iswc2016_ristoski_rdf_graph/

KDD, LOD (input), RDF, Ontologies, SPARQL Queries

Requisitos: preservar a estrutura do grafo, Não supervisionado, independente do dataset e da tarefa, representação em baixa dimensionalidade (se comparado ao tamanho do grafo)

Adaptação do Word2vec (neural language model), Caminhos no grafo (Graph path pattern e não subgraph pattern) e Graph Kernels, vetor n-dimensional para cada entidade e relação, entidades semanticamente similares aparecem próximas no espaço

Graph Walks: caminhos de distância D de cada nó são entrada

Graph Kernels: Teste de isomorfismo, gera sequências de caminhos com random walks

Tarefas de ML para Avaliação: Classificação e Regressão

Usaram DBPedia e Wikidata para avaliação

Poderia ser usado para alinhamento entre DBPedia e Wikidata além de linkar texto e dados semi estruturados como KB

Exemplo do site http://rdf2vec.org/

Caminhos aleatórios gerados que se assemelham a sentenças de texto

Hamburg -> country -> Germany            -> leader     -> Angela_Merkel
Germany -> leader  -> Angela_Merkel      -> birthPlace -> Hamburg
Hamburg -> leader  -> Peter_Tschentscher -> residence  -> Hamburg

For those random walks, we consider each element (i.e., an entity or a predicate) as a word when running word2vec. As a result, we obtain vectors for eall entities (and all predicates) in the graph.The resulting vectors have similar properties as word2vec embeddings. In particular, similar entities are closer in the vector space than dissimilar ones, which makes those representations ideal for learning patterns about those entities.

Compared the suitability for separating classes in a knowledge graph for different knowledge graph embedding methods. They have shown that RDF2vec is outperforming other embedding methods like TransE, TransH, TransD, ComplEx, and DistMult, in particular on smaller classes. On the task of entity classification, RDF2vec shows results which are competitive with more recent graph convolutional neural networks

1 Introduction

Most data mining algorithms work with a propositional feature vector representation
of the data, i.e., each instance is represented as a vector of features
f1, f2, . . . , fn, where the features are either binary (i.e., fi ∈ {true, false}),
numerical (i.e., fi ∈ R), or nominal (i.e., fi ∈ S, where S is a finite set of symbols).

In this work, we adapt language modeling approaches for latent representation
of entities in RDF graphs. To do so, we first convert the graph into a
set of sequences of entities using two different approaches, i.e., graph walks and
Weisfeiler-Lehman Subtree RDF graph kernels. In the second step, we use those
sequences to train a neural language model, which estimates the likelihood of
a sequence of entities appearing in a graph. Once the training is finished, each
entity in the graph is represented as a vector of latent numerical features.

The generation of the entities’ vectors is task and dataset independent, i.e.,
once the vectors are generated, they can be used for any given task and any
arbitrary algorithm, e.g., SVM, Naive Bayes, Random Forests, Neural Networks,
KNN, etc. Also, since all entities are represented in a low dimensional feature
space, building machine learning models becomes more efficient.

2 Related Work

approaches for generating data mining features from Linked Open Data have been proposed. Many of those approaches are supervised,
i.e., they let the user formulate SPARQL queries, and a fully automatic feature
generation is not possible.

A similar problem is handled by Kernel functions, which compute the distance
between two data instances by counting common substructures in the
graphs of the instances, i.e. walks, paths and trees. In the past, many graph
kernels have been proposed that are tailored towards specific applications [7], or
towards specific semantic representations [5]. Only a few approaches are general
enough to be applied on any given RDF data, regardless the data mining task.

Our work is closely related to the approaches DeepWalk [22] and Deep Graph
Kernels [35]. DeepWalk uses language modeling approaches to learn social representations
of vertices of graphs by modeling short random-walks on large social
graphs, like BlogCatalog, Flickr, and YouTube. The Deep Graph Kernel approach
extends the DeepWalk approach, by modeling graph substructures, like
graphlets, instead of random walks. The approach we propose in this paper differs
from these two approaches in several aspects. First, we adapt the language
modeling approaches on directed labeled RDF graphs, compared to the undirected
graphs used in the approaches. Second, we show that task-independent
entity vectors can be generated on large-scale knowledge graphs, which later can
be reused on variety of machine learning tasks on different datasets.

3 Approach

In the case of RDF graphs, we consider entities and relations between entities instead of word sequences. Thus, in order to apply
such approaches on RDF graph data, we first have to transform the graph data
into sequences of entities, which can be considered as sentences. Using those sentences,
we can train the same neural language models to represent each entity
in the RDF graph as a vector of numerical values in a latent feature space.

3.1 RDF Graph Sub-structures Extraction

We propose two general approaches for converting graphs into a set of sequences
of entities, i.e., graph walks and Weisfeiler-Lehman Subtree RDF Graph Kernels.

To generate the walks, we use the breadth-first algorithm. {BFS}

Weisfeiler-Lehman Subtree RDF Graph Kernels. In this approach, we use the subtree RDF adaptation of the Weisfeiler-Lehman algorithm presented
in [32,34]. The Weisfeiler-Lehman Subtree graph kernel is a state-of-the-art,
efficient kernel for graph comparison [30]. The kernel computes the number of
sub-trees shared between two (or more) graphs by using the Weisfeiler-Lehman
test of graph isomorphism. This algorithm creates labels representing subtrees
in h iterations.

There are two main modifications of the original Weisfeiler-Lehman graph
kernel algorithm in order to be applicable on RDF graphs [34].

The procedure of converting the RDF graph to a set of sequences of tokens
goes as follows: (i) for a given graph G = (V,E), we define the Weisfeiler-Lehman
algorithm parameters, i.e., the number of iterations h and the vertex subgraph
depth d, which defines the subgraph in which the subtrees will be counted for the
given vertex; (ii) after each iteration, for each vertex v ∈ V of the original graph
G, we extract all the paths of depth d within the subgraph of the vertex v on the
relabeled graph.

3.2 Neural Language Models – Word2vec

CBOW and Skip-gram model

Once the training is finished, all words (or, in our case, entities {tanto nós quanto arestas} ) are projected
into a lower-dimensional feature space, and semantically similar words
(or entities) are positioned close to each other.

4 Evaluation

3 Small domain-specific RDF Datasets ... the value of a certain property is used as a classification target

2 Large RDF Datasets. As large cross-domain datasets we use DBpedia and Wikidata. In our evaluation we
only consider object properties

We use the entity embeddings on five different datasets from different domains,
for the tasks of classification and regression.

We compare our approach to several baselines. For generating the data mining
features, we use three strategies that take into account the direct relations
to other resources in the graph [20], and two strategies for features derived from
graph sub-structures

Classification: From the results we can observe that the K2V approach outperforms
all the other approaches. More precisely, using the skip-gram feature vectors
of size 500 in an SVM model provides the best results on all three datasets.
The W2V approach on all three datasets performs closely to the standard graph
substructure feature generation strategies, but it does not outperform them. K2V
outperforms W2V because it is able to capture more complex substructures in
the graph, like sub-trees, while W2V focuses only on graph paths.

We can observe that the latent vectors extracted from DBpedia and Wikidata outperform all of the standard
feature generation approaches. In general, the DBpedia vectors work better than
the Wikidata vectors, where the skip-gram vectors with size 200 or 500 built on
graph walks of depth 8 on most of the datasets lead to the best performances.

On both tasks, we can observe that the skip-gram vectors perform better
than the CBOW vectors. Also, the vectors with higher dimensionality and paths
with bigger depth on most of the datasets lead to a better representation of the
entities and better performances. However, for the variety of tasks at hand, there
is no universal approach, i.e., embedding model and a machine learning method,
that consistently outperforms the others.

To analyze the semantics of the vector representations, we employ Principal
Component Analysis (PCA) to project the entities’ feature vectors into a two
dimensional feature space. We selected seven countries and their capital cities,
and visualized their vectors. ... ability of the model to automatically organize entities
of different types, and preserve the relationship between different entities.
For example, we can see that there is a clear separation between the countries
and the cities, and the relation “capital” between each pair of country and the corresponding capital city is preserved. Furthermore, we can observe that more
similar entities are positioned closer to each other, e.g., we can see that the
countries that are part of the EU are closer to each other, and the same applies
for the Asian countries.

we can observe that the number of generated features sharply
increases when adding more samples in the datasets, especially for the strategies
based on graph substructures. However, the number of features remains the same
when using the RDF2Vec approach, independently of the number of samples in
the data. Thus, by design, it scales to larger datasets without increasing the
dimensionality of the dataset.

5 Conclusion

So far we have considered only simple machine learning tasks, i.e., classification
and regression, but in the future work we would extend the number of
applications. For example, the latent representation of the entities could be used
for building content-based recommender systems [4]. The approach could also be
used for link predictions, type prediction, graph completion and error detection
in knowledge graphs [19], as shown in [15,17]. Furthermore, we could use this
approach for the task of measuring semantic relatedness between two entities,
which is the basis for numerous tasks in information retrieval, natural language
processing, and Web-based knowledge extractions [6].

Mais sobre Weisfeiler-Lehman Neural Machine for Link Prediction

https://youtu.be/dRC4T2gABS8

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens: realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward) Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Pesquisa de Doutorado da Veronica

Pesquisar este blog

RDF2Vec: RDF Graph Embeddings for Data Mining - Leitura de Artigo

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

Exploratory Search: From Finding to Understanding - Leitura de Artigo