Yueji Yang, Yuchen Li, Panagiotis Karras, and Anthony K. H. Tung. 2021. Context-aware Outstanding Fact Mining from Knowledge Graphs. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD '21). Association for Computing Machinery, New York, NY, USA, 2006–2016. https://doi.org/10.1145/3447548.3467272
ABSTRACT
An Outstanding Fact (OF) is an attribute that makes a target entity stand out from its peers.
[Diferenciar uma entidade das demais no sentido de identificar atributos e relações que são pouco comuns, que não correspondem ao padrão]
However, existing approaches to OF mining: (i) disregard the context in which the target entity appears, hence may report facts irrelevant to that context; and (ii) require relational data, which are often unavailable or incomplete in many application domains.
[Contexto é útil para esse tipo de tarefa mas o que é esse contexto?]
In this paper, we introduce the novel problem of mining Context-aware Outstanding Facts (COFs) for a target entity under a given context specified by a context entity.
Context-awareness is ensured by relying on the relevant relationships with the context entity to derive peer entities for COF extraction. Consequently, FMiner can effectively navigate the search to obtain context-aware OFs by incorporating a context entity.
We conduct extensive experiments, including a user study, to validate the efficiency and the effectiveness of FMiner.
[Será que os experimentos e o dataset podem ser reusados?]
INTRODUCTION
An OF is a true statement to the effect that an attribute of a target entity stands out in comparison to a group of peer entities. For instance, the news media often uses OFs as news lead to draft news stories, or to attract readers: “Kamala Harris is the first woman to serve as vice president” . This statement reveals an OF about the gender (attribute) of Harris (target entity) over all people (peer entities) who have served as United States vice president.
[Aquilo que diferencia das demais entidades semelhantes]
For example, given the news of Kamala Harris becoming vice president, the previous mentioned OF serves as a good fit. Nevertheless, existing approaches may generate an OF: “Kamala Harris is one of the only 0.15% Indian Americans educated at Howard University”. This would be an interesting OF in a context focusing on Harris’s education. However, in a political context involving Harris and her predecessors in office, such an education-related OF is less relevant, and may even cause disorientation.
[A relevância não é só o diferencial mas também deve ser algo dentro de um domínio de interesse]
The specification of a context entity allows users to guide the mining process, so as to derive the desired COFs.
[Seria a entidade que dá o contexto algo relacionado com o domínio?]
Following up on the previous example, Harris is the target entity and her predecessor in office, Mike Pence, the context entity.
[Indiretamente é do domínio mas o critério é uma entidade semelhante considerando a relação em comum. Como identificar qual é a entidade contexto?]
We only consider path patterns, rather than general graph patterns, because path patterns translate easily and intuitively to natural language claims.
[Ainda usa a estrutura do grafo mas a seleção da entidade de contexto semelhante fornece semântica a abordagem]
• We formalize the problem of mining Context-aware Outstanding Facts (COFs) from an open KG. To the best of our knowledge, this is the first work to study the mining of context-aware OFs from KGs, a task with important applications in Computational Journalism.
• We propose FMiner, a COF mining framework that discovers OFs by considering the path patterns between a target and a context entity, employing a novel relevance model to identify relevant patterns and a search algorithm with novel optimizations to mine OFs in real-time.
• In extensive experiments, including a user study, on queries from real-world news content, we demonstrate that FMiner discovers attention-seizing COFs from a large open KG with over a half billion edges in less than a second in most cases.
PROBLEM DEFINITION
Preliminaries
Given a target entity 𝑡 and a context entity 𝑐, we assume their corresponding nodes in the KG are known; such knowledge can be gained using the query service of the KG.
Definition 1 (Path Pattern). A path pattern refers to an ordered sequence of node variables and edge types. Let 𝑃 ( ̃𝑣0, 𝐸0, ̃𝑣1, . . . , ̃𝑣ℓ−1, 𝐸ℓ−1, ̃𝑣ℓ ), or 𝑃 ( ̃𝑣0, ̃𝑣ℓ ) for short, to denote an ℓ-hop path pattern. 𝐸𝑖 represents the type of the 𝑖𝑡ℎ edge where 𝑖 ∈ {0, . . . , ℓ − 1}.
Definition 2 (Matching Instance). A matching instance to a path pattern 𝑃 ( ̃𝑣0, ̃𝑣ℓ ) is a path instance in G(V, E, M), referred to as 𝑝 (𝑣0, 𝑒0, 𝑣1, . . . , 𝑣ℓ−1, 𝑒ℓ−1, 𝑣ℓ ) or 𝑝 (𝑣0, 𝑣ℓ ), such that the following conditions are satisfied simultaneously (i.e. 𝑝 is isomorphic to 𝑃).
(1) ∀𝑖 ∈ {0, . . . , ℓ }, M (𝑣𝑖 ) = ̃𝑣𝑖 , where ̃𝑣𝑖 is either the instance node itself 𝑣𝑖 or its type 𝑉𝑖 ;
(2) ∀𝑖 ∈ {0, . . . , ℓ − 1}, M (𝑒𝑖 ) = 𝐸𝑖 .
We use 𝑝 ⊲ 𝑃 to denote that 𝑝 is a matching instance of 𝑃.
Given a pair of target and context nodes ⟨𝑡, 𝑐⟩, we first identify a path instance 𝑝 (𝑡, 𝑐) connecting 𝑡 and 𝑐 in the KG. Then, we derive a context-anchored pattern 𝑃 which has 𝑝 (𝑡, 𝑐) as a matching instance and ends at the context node 𝑐 at the same time (Definition 3).
Definition 3 (Context-anchored Pattern). Given the target 𝑡 and the context 𝑐, a path pattern 𝑃 ( ̃𝑣0, ̃𝑣ℓ ) is called context-anchoredpattern, if and only if ̃𝑣ℓ = 𝑐 and ∃𝑝 (𝑡, 𝑐) ⊲ 𝑃 ( ̃𝑣0, ̃𝑣ℓ ). In particular, we denote a context-anchored pattern as 𝑃 ( ̃𝑣0, ̃𝑣ℓ = 𝑐) or simply 𝑃 ( ̃𝑣0, 𝑐).
Definition 4 (Context-aware Peer Entity). Given a pair 𝑡 and 𝑐, and a context-anchored pattern 𝑃 ( ̃𝑣0, ̃𝑣ℓ = 𝑐), we define 𝑣0 as a context-aware peer entity of 𝑡 if 𝑝 (𝑣0, 𝑐) ⊲ 𝑃 ( ̃𝑣0, ̃𝑣ℓ = 𝑐).
Pattern Relevance Model
Definition 5. (Node Proximity Measure). A node proximity measure 𝑆 (𝑣, 𝑢) returns a score that reflects the proximity of 𝑢 to 𝑣 in G(V, E), normalized so that Í𝑢 ∈V 𝑆 (𝑣, 𝑢) = 1.
[Remover casos irrelevantes recuperados. Pode ser usada outra regra no framework]
[Outras definições da abordagem acabam sendo mais sintáticas]
The COF Mining Problem
Definition 11 (Top-(k,l) COF Problem). Given the target and context ⟨𝑡, 𝑐⟩, and G(V, E), we find the top-𝑙 COFs with the highest strikingness scores from the top-𝑘 relevant context-anchored patterns.
In the first step, in order to identify context-anchored patterns, we need to first conduct path enumeration to find connecting paths between the target and the context nodes. Then, for each connecting path, we need to enumerate all possible patterns, and for each enumerated pattern, we further find all its matching instances in the graph for calculating the relevance score. Every procedure involved here is of exponential complexity. Even worse, the scale of today’s open KGs renders the problem more difficult. To mitigate the efficiency issue, we propose optimizations to quickly prune unpromising patterns and avoid any redundant computation to speed up the search.
ALGORITHMS AND OPTIMIZATIONS
[Não vale a pena detalhar pq não é comparável com a minha abordagem]
Discussion. There are many keyword search approaches [ 1 , 14 , 16 ] that involve graph traversal and path enumeration processes. However, our scenario differs in two aspects, rendering their technique not applicable. First, the keyword search approaches aim at finding a concise subtree, with one path from each keyword cluster (i.e., all nodes containing one keyword) to the root. Therefore, their techniques and optimizations mainly focus on quickly discovering the optimal answer subtree. To speed up the process, they build indexes [14 ], such as keyword-node lists and node-keyword map, to maintain shortest paths between keywords and nodes. Whereas, in our case, there are only two nodes under consideration. Those indexes do not bring any benefit since we need not consider the combinations of paths from different keyword clusters. Second, keyword search approaches only favor the shortest path between any two nodes [ 14 ]. Nonetheless, we need to evaluate the patterns derived from all connecting paths for top-𝑘 relevant pattern search.
[Comparação com a tarefa de keyword search em grafos em relação a otimização que a indexação traz]
EXPERIMENTS
Knowledge Graph and Query Pairs. We use Wikidata KG [5] with 39.9M nodes and 592.8M edges. We randomly collect 200 query pairs of co-occurring target and context entities from a public news corpus.
https://cs.nyu.edu/~kcho/DMQA/
Effectiveness Study
In this section, we conduct a user study to validate two aspects of FMiner: (i) the resulting COFs are relevant regarding the context entity; (ii) the resulting COFs are interesting facts regarding the target entity.
In the user study, we collect 10 pairs of entities that are well-known and relevant to each other. For each entity pair, we fix one entity as the target, the other as the context.
Then, two questions are asked: (1) Which fact is more relevant regarding the [context entity]? (2) Which fact sounds more interesting or striking regarding the [target entity]? For each of the two questions, we ask participants to select one out of three choices: Fact 1, Fact 2 and Neither. We convert OFs into natural languages for ease of understanding of the participants. We have collected the results from 20 students who are unaware of our research.
[Converter em linguagem natural]
Related Works
Entity Relationship Discovery on Graphs. One popular method of relationship discovery is keyword search on graphs, to mention a few [1, 14, 16, 17, 33, 34]. These methods return concise subgraphs that connect multiple input keywords. However, these approaches do not consider any patterns which are required for generating OF. In addition, [ 32 ] proposes to find tree patterns and output table answers. Furthermore, REX [ 7] proposes to find path patterns that connect two entities. It focuses on combining the path patterns to form a graph pattern that explains the relationship between the entity pair. Neither of the two works [7 , 32] consider context relevance ranking, peer entity finding nor OF extraction.
Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo. Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN): print('DEF:',s
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.