Query Embedding on Hyper-relational Knowledge Graphs

Query Embedding on Hyper-relational Knowledge Graphs - Leitura de Artigo

Alivanistos, Dimitrios, Max Berrendorf, Michael Cochez and Mikhail Galkin. “Query Embedding on Hyper-relational Knowledge Graphs.” ArXiv abs/2106.08166 (2021): n. pag.

GitHub -> https://github.com/DimitrisAlivas/StarQE

Abstract

Multi-hop logical reasoning is an established problem in the field of representation learning on knowledge graphs (KGs). It subsumes both one-hop link prediction as well as other more complex types of logical queries.

[Multi-hop é query usando embeddings e não Link Prediction]

In queries, this context modifies the meaning of relations, and usually reduces the answer set. Hyper-relational queries are often observed in real-world KG applications, and existing approaches for approximate query answering cannot make use of qualifier pairs.

[Abordagens existentes de embeddings com triplas não usam qualificadores ou tratam as triplas da reificação da mesma forma que outras triplas. Mas na abordagem que estamos propondo a inclusão do contexto não necessariamente irá reduzir o conjunto resposta.]

In this work, we bridge this gap and extend the multi-hop reasoning problem to hyper-relational KGs allowing to tackle this new type of complex queries. Building upon recent advancements in Graph Neural Networks and query embedding techniques, we study how to embed and answer hyper-relational conjunctive queries. Besides that, we propose a method to answer such queries and demonstrate in our experiments that qualifiers improve query answering on a diverse set of query patterns

[Consultas CONJUNTIVAS respondidas com embeddings - sub-grafos conexos]

1 Introduction

Query embedding (QE) on knowledge graphs (KGs) aims to answer logical queries using neural reasoners instead of traditional databases and query language.

[User embeddings ao invés de GQL]

On the other hand, QE bypasses the need for a database or query engine and performs reasoning directly in a latent space by computing a similarity score between the query representation and entity representations.

[Mas poderia considerar vectors databases ... Esse caso também resolve com o KGTK]

A query representation is obtained by processing its equivalent logical formula where joins become intersections (∧), and variables are existentially quantified (∃). A flurry of recent QE approaches [ 13, 20 , 19 , 17, 2] expand the range of supported logical operators and graph patterns.

In contrast, an increasing amount of publicly available [29 , 23 ] and industrial KGs adopt a hyper-relational modeling paradigm where typed edges may have additional attributes, in the form of key-value pairs, known as qualifiers. Several standardization efforts embody this paradigm, i.e., RDF* [ 14 ] and Labeled Property Graphs (LPG), with their query languages SPARQL* and GQL, respectively. Such hyper-relational queries involving qualifiers are instances of higher-order logical queries, and so far, there has been no attempt to bring neural reasoners to this domain.

[Motivação para trabalhar com o modelo hyper-relational]

Second, we build upon recent advancements in Graph Neural Networks (GNNs) and propose a method to answer conjunctive hyper-relational queries in a latent
space.

[Contribuição em Query Embeddings]

QUERY2BOX [20], proposed to represent queries as hyper-rectangles instead of points in a latent space. Geometrical operations on those rectangles allowed to answer queries with disjunction (∨) re-written in the Disjunctive Normal Form (DNF).

[Disjunção]

Still, all of the described approaches are limited to triple-based KGs while we extend the QE problem to the domain of hyper-relational KGs. As our approach is also based on query graphs, we adopt and further study some of MPQE observations as to query diameter and generalization.

[MPQE -representar as queries como DAGs]

3 Hyper-relational Knowledge Graphs and Queries

Definition 3.1 (Hyper-relational Knowledge Graph). Given a finite set of entities E, and a finite set of relations R, let Q = 2(R×E). Then, we define a hyper-relational knowledge graph as G = (E, R, S), where S ⊂ (E × R × E × Q) is a set of (qualified) statements.

[Definem Statements como parte do hyper-relational]

Hyper-relational KGs extend traditional KGs by enabling to qualify triples.
For instance, consider the statement (AlbertEinstein, educated_at, ETHZurich, {(degree, BSc)}). Here, the qualifier pair (degree, BSc) gives additional context on the base triple (AlbertEinstein, educated_at, ETHZurich)

Note that this statement can equivalently be written in first order logic (FOL). Specifically, we can write it as a statement with a parameterized function symbol educated_at{(degree:BSc)}(AlbertEinstein, ETHZurich). Then, we note that Q is a finite set, meaning that also the number of different parameterizations for the function symbol is finite, which means that this becomes a first order logic statement. In this formalism, the KG is the conjunction of all FOL statements.

[FOL: (AlbertEinstein, educated_at, ETHZurich, {(degree, BSc)}) = educated_at{(degree:BSc)}(AlbertEinstein, ETHZurich)]

In this work, we restrict the qualifiers to respect monotonicity in the KG in the following sense:
(h, r, t, qp) ∈ G ∧ qp′ ⊆ qp =⇒ (h, r, t, qp′) ∈ G
This implies that if a qualified statement in the KG has a set of qualifier pairs, then the KG also contains the qualified statement with any subset thereof. As a result, some types of qualifying nformation cannot be used, e.g., a qualifier indicating that a relation does not hold (i.e., negation). This breaks monotonicity since the existence of such a statement would further imply the existence of that statement without the qualifier, leading to contradiction.

[Não consideram a negação do statement. Em KG hiper relacional deve ser possível representar inclusive statements contraditórios. Na abordagem que estamos propondo pode ser possível criar uma interpretação para as contradições]

Definition 3.2 (Hyper-relational Query). Let V be a set of variable symbols, and TAR ∈ V a special variable denoting the target of the query. Let E+ = E ] V. Then, any subset Q of (E+ × R × E+ × Q) is a valid query if its induced graph 1) is a directed acyclic graph, 2) has a topological ordering in which all entities (in this context referred to as anchors) occur before all variables, and 3) TAR must be last in the topological orderings.

[Entender melhor essa definição]

The answers AG (Q) to the query Q are the entities e ∈ E that can be assigned to TAR, for which there exist a variable assignment for all other variables occurring in the query graph, such that the instantiated query graph is a subgraph of the complete graph G. From our definition of KG, queries, and the monotonicity requirement, we can derive that adding more qualifiers to the statements of a query can only reduce the set of possible answers to the query.

Problem Definition. Given the incomplete KG G (part of the not available complete KG ˆG) and a query Q. Rank all entities in G such that answers to the query, if it were asked in the context of the complete KG ˆG, are at the top of the ranking.

[Considera que o KG não é completo]

Since the given KG is not complete, we cannot solve this problem directly as a graph matching problem as usually done in databases. Instead, we compute a latent representation of the query, such that it is close to the embeddings of entities which are the correct answers to the query.

[Usar embeddings para lidar com a incompletudo do grafo ... respostas aproximadas da melhor respostas possível poderia usar embeddings também]

4 Model Description

Our model is not trained directly on the KG, but rather with a set of queries sampled from it

To score answer entity candidates, we use the similarity of the query representation and the entity representation, i.e., score(Q, e) = sim(xQ, E∗[e]), for a simple vector similarity function, such as the dot product, or cosine similarity.

[Similaridade do Cosseno entre os vetores q e h]

5 Experiments

In this section, we empirically evaluate the performance of query answering over hyper-relational KGs. We design experiments to tackle the following research questions: RQ1) Does query answering performance benefit from qualifiers? RQ2) What are the generalization capabilities of hyper-relational query answering? RQ3) Does query answering performance depend on the physical representation of a hyper-relational KG, i.e., reification?

[Experimentos visam performance, no sentido de eficácia?]

5.1 Dataset

new hyper-relational QE dataset based on WD50K comprised of Wikidata statements, with varying numbers of qualifiers.

WD50K QE. We introduce hyper-relational variants of 7 query patterns commonly used in the related work [13, 20, 2, 24] where each edge can have [0, n] qualifier pairs. The patterns contain projection queries (designated with -p), intersection queries (designated with -i), and their combinations. Note that the simplest 1p pattern corresponds to a well-studied link prediction task. Qualifiers enable more flexibility in query pattern construction, i.e., we further modify formulas by conditioning the existence of qualifiers and their amount over a particular edge.

[Abordagem de geração das consultas]

we want hyper-relational QE models to be robust to the underlying graph topology. For this reason, we ship dataset queries in both formats, i.e., hyper-relational RDF* and reified with triples, and study the performance difference in a dedicated experiment.

[Comparação entre hyper relacional e reificação]

5.2 Evaluation Protocol

In all experiments, we evaluate the model in the rank-based evaluation setting. Each query is encoded into a query embedding vector xq ∈ R. We compute a similarity score sim(xq, E[e]) between the query embedding and the entity representation for each entity e ∈ E. The rank r ∈ N+ of an entity is its position in the list of entities sorted decreasingly by score.

[Resposta ordenada, a abordagem de melhor resposta não faz ranking]

5.3 Hyper-relational Query Answering

As we are the first to introduce the problem of hyper-relational query answering, there is no established baseline available at of the time of writing. Hence, we compare our method to several strong baselines.

Discussion.

The results shown in Table 1 indicate that STARQE, in general, is able to tackle hyper-relational queries of varying complexity. That is, the performance on complex intersection and projection queries is often higher than that of a simple link prediction (hr-1p). Particularly, queries with intersections (−i) demonstrate outstanding accuracy.

[A abordagem não é a melhor sempre, vários cenários de BGP testados]

We observe a comparative performance running the Reification baseline. It suggests that our QE framework is robust to the underlying graph topology retaining the same logical interpretation of a complex query. We believe it is a promising sign of enabling hyper-relational query answering on a wide range of physical graph implementations.

[Usou a reificação padrão RDFS mas existe várias opções]

5.4 Generalization

Following the related work, we experiment with how well our approach can generalize to complex query patterns if trained on simple ones. Note that below we understand all query patterns as hyper-relational, i.e., having the hr- prefix.

6 Limitations & Future Work

Secondly, our work assumes monotonicity of the qualifiers. Working without this assumption is an interesting future direction.

[Aceitar negação e afirmações contraditórias]

Thirdly, there are several other logical operators we could include. Not only negation (which could be included as a qualifier), but also disjunctions, cardinality constraints, etc.

[Operadores da query]

Another interesting next step is to allow for variables in more positions of the query. In this work, we only allow variables in the head and tail positions. Nonetheless, one can also formulate more general graph queries with variables in the qualifier value, and even in the relation position.

[Somente node1 e node2 (source e target), não tratou predicados (relações e propriedades), literais (valores de propriedades) nem qualificadores e seus valores (node3 ou literal)]

One more aspect to be looked at are the many operators and possibilities which query languages like SPARQL have. For example, queries including paths, aggregations, sub-queries, filters on literals, etc.

[Também não estou incluindo operadores, somente vou trabalhar com BGP]

Further research work is also needed in the domain of explainability. The system does currently produce an answer without providing an explanation. One potential solution is to not only look for the target of the query, but also provide answers for the intermediate variables.

[A resposta na nossa abordagem são statements contextualizados derivados da query, é auto explicativo. Mas se incluir embeddings como explicar?]

A final interesting research direction is the use of approximate query answering for the creation of query plans. If an set of likely correct answers can be determined fast, this information could be used to choose a more optimal plan for exact query execution.

[Como derivar uma GQL generalizada de um conjunto de respostas em sub-grafos]

7 Conclusion

In this work, we have studied and addressed the extension of the multi-hop logical reasoning problem to hyper-relational KGs. We touched upon the theoretical considerations of having qualifiers in the context of query answering and discussed the effects it can have on query answering such as cardinality of the answer set. We also proposed the first hyper-relational QE model, STARQE, based on a Graph Neural Network encoder to work in this new setup.

[Query embeddings para Hyper relational]

Further, we introduced a new dataset, WD50K QE, with hyper-relational variants of 7 well studied query patterns and analysed the performance of our models on each of them individually and on average. Our results suggest that qualifiers indeed help to obtain more accurate answers compared to triple-only graphs.

[Dataset com qualificadores]

A further issue is the assumption of this and related techniques that the data used for training only contains true facts. If this data is (intentionally) biased or erroneous to start with, further biased or wrong information will be derived, which could lead to harmful conclusions and decisions. Besides, even if the input data to the system is correct, the system could due to its imperfections still derive incorrect answers to queries, which might, again lead to wrong conclusions. Hence, users of this technology must be made aware that answers to their queries are always approximate, and this technology ought not to be used in a process where correctness is critical.

[Afirmações em KG podem ser verdadeiras ou falsas, depende de quem vai consumir, para o que vai ser usado tanto quanto de quem as criou]

Appendix

A Types of Graphs

Hyper-relational Graphs KGs used in our work are a subset of hyper-relational graphs. In general, hyper-relational graphs can also contain literal values, like integers and string values as qualifier values. Labeled property graphs are another example of a subset of hyper-relation graphs, but they typically only allow literals as qualifier values.

[LPG é hiper relacional somente para arestas em nós e só permite literais nos qualificadores]

The Wikidata model [9] and RDF* [ 14 ] allows for both literals and entities as values. Many implementations of RDF* do make the monotonicity assumption, which we also made in the paper. Some will also merge statements in case they have the same main triple. The resulting statement will have the same main triple, but as qualifier information, the union of the sets of qualifier pairs of the original statements.

[RDF-STAR não é multigrafo]

Hypergraphs are another type of graphs, initially proposed by Berge [3], where hyperedges link one or more vertices. These hyperedges group together sets of nodes in a relation, but lose information of the particular roles that entities play in this relation. Besides, each different composition of elements in the relation introduces a new hyperedge type, causing a combinatorial explosion of these types.

[perda semantica do hiper grafo]

Reified Graphs and Hyper-relational have equivalent expressive power compared to labelled directed graphs. The reason is that we can define a bijection between these graphs. However, depending on the use case, the different graph types have benefits. When we convert a hyper-relational graph to an RDF graph, we end up with what is called the reified graph. There are multiple approaches on how to perform this conversion.

[mesmo sendo possível converter, as queries ficam mais complexas - verbosidade]

B The WD50K-QE Dataset

To sample our queries, our data are hosted on a graph database with support for hyper-relational data. For our use-case, we use anzograph

[AnzoGraph tem suporte a RDF-Star]

WD50K is a dataset created to counter the flaws of other hyper-relational datasets. It is publicly available by the authors, in CSV format. In order to utilise it for our work, we started by converting the dataset from CSV to RDF*.

[GitHub -> https://github.com/migalkin/StarE/tree/master/data/clean]

D Size of the Answer Set of a Query with more Qualifiers

When we have a query with or without qualifiers, and we add more qualifiers, the number of answers to this query can only become smaller. Intuitively, this happens because the query becomes more specific.

[Se o qualificador for usado para filtrar sim. Mas na melhor resposta se não encontrar o qualificador são geradas respostas aproximadas]

Comentários

Veronica dos Santos14 de julho de 2022 às 08:52
Na versão camera ready do "The Tenth International Conference on Learning Representations
ICLR 2022" foi criado um baseline chamado "Oracle" para comparação do desempenho QA da proposta STARQE:

Oracle. If we take away the qualifier information, we could compare with several QE approaches (e.g. GQE, MPQE, Query2box, BetaE, EmQL, and CQD). The Oracle setup, is not just an upper bound to these, but to all possible, non-hyper-relational QE models. It simulates the best possible QE model that has perfect link prediction and ranking capabilities, but without the ability to use qualifier information. Using this setting we investigate the impact of qualifier information on our queries.

Essa versão é diferente do baseline Triple-Only:

Triple-Only For the first experiment, we remove all qualifiers from the hyper-relational statements in the query graph leaving the base triples (h; r; t). Thus, we isolate the effect of qualifiers in answering the queries correctly. Note that in effect, this is similar to the MPQE approach, but with a better GNN and aggregation. To do this, we have to retain the same queries as in other experiments, but remove the qualifiers upon loading.
ResponderExcluir
Respostas
Veronica dos Santos14 de julho de 2022 às 09:06
Para o ICLR 2022 o paper pasosu por uma revisão aberta que está registrada aqui -> https://openreview.net/forum?id=4rLw09TgRw9
ResponderExcluir
Respostas
Veronica dos Santos14 de julho de 2022 às 09:10
Ainda preciso entender essa definição de Query hiper relacional para Question & Answering (Q&A) mas já é diferente da Query hiper relacional para Busca Exploratória pq o objetivo não é recuperar as entidades alvo e sim um conjunto de afirmações contextualizadas. Mas ambas são especificadas como BGP (basic graph pattern)

Definition 3.2 (Hyper-relational Query). Let V be a set of variable symbols, and TAR ∈ V a special variable denoting the target of the query. Let E+ = E ]V. Then, any subset Q of (E+ ×R×E+ ×Q) is a valid query if its induced graph 1) is a directed acyclic graph, 2) has a topological ordering in which all entities (in this context referred to as anchors) occur before all variables, and 3) TAR must be last in the topological orderings

The answers AG (Q) to the query Q are the entities e ∈ E that can be assigned to TAR, for which there exist a variable assignment for all other variables occurring in the query graph, such that the instantiated query graph is a subgraph of the complete graph G
ResponderExcluir
Respostas
Veronica dos Santos14 de julho de 2022 às 09:32
A definição de problema também é diferente pq o STARQE é para Q&A e quer recuperar entidades, aproximadas e ordenadas por essa proximidade, para responder a pergunta. Já na Busca Exploratória do KG a melhor resposta possível pode variar de uma resposta exata (resolvida com GQL), resposta possível (resolvida com GQL flexibilizada no contexto), resposta aproximada (resolvida com GQL sem o contexto da query ou com contexto implícito ... que poderia incluir uma tentativa de embeddings com um limite de corte ou se for possível EXPLICAR) ou até não resposta (se nenhuma afirmação corresponder a consulta)

Problem Definition. Given the incomplete KG G (part of the not observable complete KG ˆG) and a query Q. Rank all entities in G such that answers to the query, if it were asked in the context of the complete KG ˆG, are at the top of the ranking. Since the given KG is not complete, we cannot solve this problem directly as a graph matching problem as usually done in databases. Instead, we compute a latent representation of the query, such that it is close to the embeddings of entities which are the correct answers to the query.
ResponderExcluir
Respostas
Veronica dos Santos14 de julho de 2022 às 09:36
Dataset WD50K é do StarE

@inproceedings{StarE,
title={Message Passing for Hyper-Relational Knowledge Graphs},
author={Galkin, Mikhail and Trivedi, Priyansh and Maheshwari, Gaurav and Usbeck, Ricardo and Lehmann, Jens},
booktitle={EMNLP},
year={2020}
}
ResponderExcluir
Respostas

Adicionar comentário

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Pesquisa de Doutorado da Veronica

Pesquisar este blog