Pular para o conteúdo principal

Weakly-supervised Contextualization of Knowledge Graph Facts - 2018 - SIGIR

Nikos Voskarides, Edgar Meij, Ridho Reinanda, Abhinav Khaitan, Miles Osborne, Giorgio Stefanoni, Prabhanjan Kambadur, and Maarten de Rijke. 2018. Weakly-supervised Contextualization of Knowledge Graph Facts. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR '18). Association for Computing Machinery, New York, NY, USA, 765–774. https://doi.org/10.1145/3209978.3210031

ABSTRACT
...
When presenting a KG fact to the user, providing other facts that are pertinent to that main fact can enrich the user experience and support exploratory information needs. KG fact contextualization is the task of augmenting a given KG fact with additional and useful KG facts. The task is challenging because of the large size of KGs; discovering other relevant facts even in a small neighborhood of the given fact results in an enormous amount of candidates.

[O contexto aqui são fatos vizinhos e relevantes pra o fato "central"]

We introduce a neural fact contextualization method (NFCM) to address the KG fact contextualization task. NFCM first generates a set of candidate facts in the neighborhood of a given fact and then ranks the candidate facts using a supervised learning to rank model.

[Ordenar os fatos vizinhos para identificar os mais relevantes. Contextualização de fatos é uma tarefa]

Evaluation using human assessors shows that it significantly outperforms several competitive baselines

[Avaliação com humanos]

1 INTRODUCTION

Knowledge graphs (KGs) have become essential for applications such as search, query understanding, recommendation and question answering because they provide a unified view of real-world entities and the facts (i.e., relationships) that hold between them [6 , 7, 22, 34 ].

[Fatos são só os relacionamentos entre as entidades ou entre entidades e conceitos]

Previous work has focused on augmenting entity cards with facts that are centered around, i.e., one-hop away from, the main entity of the query [17].

[Query star-join]

..., we can exploit the richness of the KG by providing query-specific additional facts that increase the user’s understanding of the fact as a whole, and that are not necessarily centered around only one of the entities.

[Contexto além dos qualificadores]

Query-specific relevant facts can also be used in other applications to enrich the user experience.

[Em busca exploratória]

In this paper, we address the task of KG fact contextualization, that is, given a KG fact that consists of two entities and a relation that connects them, retrieve additional facts from the KG that are relevant to that fact.

[Fatos adicionais não poderiam causar information overloading?]

We propose a neural fact contextualization method (NFCM), a method that first generates a set of candidate facts that are part of {1,2}-hop paths from the entities of the main fact. NFCM then ranks the candidate facts by how relevant they are for contextualizing the main fact.

[A função de ranqueamento foi aprendida com método supervisionado. Então é aplicada as triplas resultantes das star-join das duas entidades envolvidas. As star-joing são no padrão V->?u->?w ou ?w->?u->V]

We estimate our learning to rank model using supervised data. The ranking model combines (i) features we automatically learn from data and (ii) those that represent the query-candidate facts with a set of hand-crafted features we devised or adjusted for this task.

[Aprendeu e ajustou manualmente]


2 PROBLEM STATEMENT


2.1 Preliminaries
 
Let E = En Ec be a set of entities, where En and Ec are disjoint sets of non-CVT and CVT entities, respectively. 
 
[Compound Value Type (CVT) entities are special entities frequently used in KGs such as Freebase and Wikidata to model fact attributes. Em outras palavras, CVT são os nós brancos da reificação.]
 
Furthermore, let P be a set of predicates. A knowledge graph K is a set of triples s, p, o, where s, o E and p P. By viewing each triple in K as a labelled directed edge, we can interpret K as a labelled directed graph. We use Freebase as our knowledge graph [8, 24].
 
[Não diferenciou de propriedade ou relação mas só vai expandir usando os vértices e não os literais]
 
We define a fact as a path in K that either: (i) consists of 1 triple, s0 E and t0 En (i.e., s0 may be a CVT entity), or (ii) consists of 2 triples, s0, t1 En and t0 = s1 Ec (i.e., t0 = s1 must be a CVT
entity). A fact of type (i) can be an attribute of a fact of type (ii), iff they have a common CVT entity (see Figure 2 for an example).
 
[Tratar um fato como mais de uma tripla se houver reificação]
 
Let R be a set of relationships where a relationship r R is a label for a set of facts that share the same predicates but differ in at least one entity. For example, spouseOf is the label of the fact depicted in the top part of Figure 2 and consists of two triples. Our definition of a relationship corresponds to direct relationships between entities, i.e., one-hop paths or two-hop paths through a CVT entity. For the
remainder of this paper, we refer to a specific fact f as r s, t, where r R and s, t E.
 
2.2 Task definition
 
Given a query fact fq and a KG K, we aim to find a set of other, relevant facts from K. Specifically, we want to enumerate and rank a set of candidate facts F = { fc : fc K, fc , fq } based on their relevance to fq
 
[Nem toda tripla ligada a uma das entidades será considerada relevante]
 
3 METHOD
 
In this section we describe our proposed neural fact contextualization method (NFCM) which works in two steps. First, given a query fact fq , we enumerate a set of candidate facts F = { fc : fc K }. Second, we rank the facts in F by relevance to fq to obtain a final ranked list F using a supervised learning to rank model.
 
3.1 Enumerating KG facts
 
.. we limit F to the set of facts that are in the broader neighborhood of the two entities s and t. Intuitively, facts that are further away from the two entities of the query fact are less likely to be relevant.
.... [exclusões] (i) CVT entities are not counted as hops, (ii) we do not include fq in F as it is trivial, and (iii) to reduce the search space, we do not expand intermediate neighbors that represent an entity class or a type (e.g., “actor”) as these can have millions of neighbors. 
 
[Realmente os tipos podem gerar muitas triplas]

3.2 Fact ranking
 
For each candidate fact fc F , we create a pair (fq , fc ) ... and score it using a function u : (fq , fc ) → [0, 1] ∈ R (higher values indicate higher relevance). We then obtain a ranked list of facts F by sorting the facts in F based on their score.

[Pontuação entre 0 e 1 para ordenar]

Learning procedure. We train a network that learns the scoring function u(fq , fc ) end-to-end in mini-batches using stochastic gradient descent... We optimize the model parameters using Adam [19 ]. During trainingwe minimize a pairwise loss to learn the function u, while during inference we use the learned function u to score a query-candidate fact pair (fq , fc ). ....
 
[Detalhes em ML]
 
Network architecture. Figure 4 shows the network architecture wedesigned for learning the scoring function u(fq , fc ). We encode thequery fact fq in a vector vq using an RNN. ... 

Finally, MLP-o([vq , va , x]) is a multi-layer perceptron with α hidden layers of dimension β and one output layer that outputs u(fq , fc ). We use a ReLU activation function in the hidden layers and a sigmoid activation function in the output layer. We vary the number of layers to capture non-linear interactions between the features in vq , va , and x.
 
PredFreq(p) = |TriplesPred(p)| / NumTriples  
EntFreq(e) = |TriplesEnt(e)| / NumTriples  
PFout (p, e) = |TriplesSubj(e) ∩ TriplesPred(p)| / |TriplesSubj(e)|  
PFin (p, e) = |TriplesObj(e) ∩ TriplesPred(p)| / |TriplesObj(e)|  
 
EntTypeSim(e1, e2) = JaccardSim(Types(e1), Types(e2)).
 
[Métricas usadas para calibrar a função de ranqueamento]
 
4 EXPERIMENTAL SETUP
 
In this section we describe the setup of our experiments that aim to answer the following research questions:
 
[As perguntas são do experimento e não só do problema de pesquisa]
 
4.2 Dataset
 
Our dataset consists of query facts, candidate facts, and a relevance label for each query-candidate fact pair.
 
[fato de origem, todos os fatos vizinhos até a distância 2 e um rótulo de relevância em escala de 3 níveis]
 
4.3 Gathering noisy relevance labels
 
Gathering relevance labels for our task is challenging due to the size and heterogeneous nature of KGs, i.e., having a large number of facts and relationship types. Therefore, we turn to distant supervision [23] to gather relevance labels at scale. We choose to get a supervision signal from Wikipedia for the following reasons: (i) it has a high overlap of entities with the KG we use, and (ii) facts that are in KGs are usually expressed in Wikipedia articles alongside other, related facts. 
 
[Extraiu uma parte do gabarito da Wikipedia]
 
4.4 Manually curated evaluation dataset
 
In order to evaluate the performance of NFCM on the KG fact contextualization task, we perform crowdsourcing to collect a human-curated evaluation dataset.
 
We use the CrowdFlower platform, and ask the annotators to judge a candidate fact w.r.t. its relevance to a query fact. We provide the annotators with the following scenario
 
We ask the annotators to assess the relevance of a candidate fact in a 3-graded scale:
very relevant: I would include the candidate fact in the description of the query fact; the candidate fact provides additional context to the query fact.
somewhat relevant: I would include the candidate fact in the description of the query fact, but only if there is space.
irrelevant: I would not include the candidate fact in the description of the query fact
 
Each query-candidate fact pair is annotated by three annotators. We use majority voting to obtain the gold labels, breaking ties arbitrarily. The annotators get a payment of 0.03 dollars per query-candidate fact pair.
 
[Anotadores humanos via crowd source]
 
4.5 Heuristic baselines
 
To the best of our knowledge, there is no previously published method that addresses the task introduced in this paper. Therefore, we devise a set of intuitive baselines that are used to showcase
that our task is not trivial. 
 
[Os autores criaram os baselines com heurísticas. Interessante porém potencialmente enviesado .... ]
 
4.6 Implementation details
 
The models described in Section 3.2 are implemented in TensorFlow v.1.4.1 [ 1]. Table 5 lists the hyperparameters of NFCM. We tune the variable hyper-parameters of this table on the validation set and
optimize for NDCG@5
 
5 RESULTS AND DISCUSSION
 
In our first experiment, we compare NFCM to a set of heuristic baselines we derived to answer RQ1. ... We conclude that the task we define in this paper is not trivial to solve and simple heuristic functions are not sufficient.
 
In our second experiment we compare NFCM with distant supervision and aim to answer RQ2. ... conclude that learning ranking functions (and in particular NFCM) based on the signal gathered from distant supervision is beneficial for this task. 
 
[E também as outras duas perguntas ...]   

6 RELATED WORK

[A tarefa de geração de fatos para contextualizar seria inédita e interessante para busca exploratória. E baseada somente nas entidades da tripla.]

The specific task we introduce in this paper has not been addressed before, but there is related work in three main areas: entity relationship explanation, distant supervision, and fact ranking.

6.1 Relationship Explanation

Explanations for relationships between pairs of entities can be provided in two ways: structurally, i.e., by providing paths or sub-graphs in a KG containing the entities, or textually, by ranking or generating text snippets that explain the connection.

[Os caminhos seriam o mais adequado para a busca exploratória no grafo]

6.3 Fact Ranking

In fact ranking, the goal is to rank a set of attributes with respect to an entity. Hasibi et al . [17] consider fact ranking as a component for entity summarization for entity cards. They approach fact ranking as a learning to rank problem. They learn a ranking model based on importance, relevance, and other features relating a query and the facts.

Graph matching involves matching two graphs and discovering the patterns of relationships between them to infer their similarity [ 11 ]. Although our task can be considered as comparing a small query subgraph (i.e., query triples) and a knowledge graph, the goal is different from graph matching which mainly concerns aligning two graphs rather than enhancing one query graph.

[A abordagem expande o resultado com mais triplas mas faz uma seleção das triplas baseada na relevância calculada]

Our work differs from the work discussed above in the following major ways. First, we enrich a query fact between two entities by providing relevant additional facts in the context of the query fact, taking into account both the entities and the relation of the query fact. Second, we rank whole facts from the KG instead of just entities.

@inproceedings{10.1145/3209978.3210031,
author = {Voskarides, Nikos and Meij, Edgar and Reinanda, Ridho and Khaitan, Abhinav and Osborne, Miles and Stefanoni, Giorgio and Kambadur, Prabhanjan and de Rijke, Maarten},
title = {Weakly-Supervised Contextualization of Knowledge Graph Facts},
year = {2018},
isbn = {9781450356572},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3209978.3210031},
doi = {10.1145/3209978.3210031},
booktitle = {The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval},
pages = {765–774},
numpages = {10},
keywords = {distant supervision, fact contextualization, knowledge graphs},
location = {Ann Arbor, MI, USA},
series = {SIGIR '18}
}
 
 

 

Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...