Pular para o conteúdo principal

Fast and Exact Rule Mining with AMIE 3 - Leitura de Artigo - Ano 2020

Abstract. .. Due to the exponential search space, rule mining approaches still have difficulties to scale to today’s large KBs. In this paper, we present AMIE 3, a system that employs a number of sophisticated pruning strategies and optimizations. ... able to compute the exact confidence and support of each rule. Our experiments on DBpedia, YAGO, and Wikidata show that AMIE 3 beats the state of the art by a factor of more than 15 in terms of runtime.

[O problema aqui é o desempenho de mineração de regras em função do amplo espaço de busca em KG "grandes". Estratégias de poda e otimização que ainda permitem o cálculo da confiança e suporte de cada regra gerada]

[Eficácia do AMIE estaria descrita nos outros artigos, aqui é Eficiência]

1 Introduction

Rule mining is the task of automatically finding logical rules in a given KB....Such rules usually come with confidence scores that express to what degree a rule holds. The rules can serve several purposes: First, they serve to complete the KB... Second, they can serve to debug the KB. Finally, rules are useful in downstream applications such as fact prediction [5,12,14,18], data and ontology alignment [7,10], fact checking [2], and error detection [1].

[Completar o KG inferindo novas alegações, esta tarefa também é feita por Link Prediction via ML. Debugar o KG no sentido de encontrar potenciais erros. ]

[Alternativas ao problema de desempenho em "grandes" KG seria minerar uma amostra ou só gerar as regras suficientes para gerar afirmações e não cobrir negações]

2 Related Work

First Generation Rule Mining. Inductive Logic Programming (ILP) is the task of learning rules from positive and negative examples ... Hence, they are generally unsuitable for today’s KBs for two reasons: (i) they were not designed to scale to millions of facts, and (ii) they do not account for the Open World Assumption (OWA) made by current KBs.

[Não se aplicam a KGs: volume e OWA]

Second Generation Rule Mining. AMIE (and its successor AMIE+) ... The Ontological Pathfinding method (OP) [3,4] resorts to a highly concurrent architecture based on Spark ...

[Apresentam problemas de Eficiência principalmente para calcular o suporte e confiança das regras]

partial completeness assumption (PCA) ... RudiK [15] is a recent rule mining method that applies the PCA to generate explicit counter-examples that are semantically related. ... Thus, differently from exhaustive rule mining approaches [3,8,9,11], Rudik aims to find rules that make good predictions, not all rules above a given confidence threshold. This non-exhaustivity endows RudiK with comparable performance to AMIE+ and OP.

[Não gera todas as regras possíveis, usa contra-exemplos para ser PCA. Não explica o que faz para identificar os erros dentro do KB e se isto afeta suporte e confiança da regra]

3 Preliminaries

We model a knowledge base (KB) K as a set of assertions r(s, o), also called facts, with a subject s, a relation r and an object o.

[Relações binárias, não consideram qualificadores e nem o modelo multi camadas. Já existem abordagens ML para Link Prediction que consideram os qualificadores]

Atoms and Rules. An atom is an expression of the form r(X,Y), where r is a relation and X, Y are either constants or variables. From now on, we denote variables by lowercase letters, whereas constants (entities) are always capitalized. An atom is instantiated if at least one of its arguments is a constant, as in livesIn(x, Berlin). If both arguments are constants, the atom is grounded and it is tantamount to a fact.

[Átomos instanciados tem sujeito ou objeto como constantes, são fatos se ambos são constantes]

We define the operator var(A) so that it returns the set of variables of an atom A.

[Lista de variáveis de uma query A]

A (conjunctive) query is a conjunction of atoms: B1 ∧ ... ∧ Bn.

A substitution σ is a partial mapping from variables to constants. Substitutions can be straightforwardly extended to atoms and conjunctions. A result of a query B1 ∧ ... ∧ Bn on a KB K is a substitution σ that (i) maps all variables and (ii) that entails σ(Bi) ∈ K ∀i ∈ {1, ..., n}.

[Ao executar a query e retornar fatos todas as variáveis se tornam constantes]

A (Horn) rule is a formula of the form B ⇒ H, where the B is a query of body atoms B1, ...,Bn, and H is the head atom.

[Query A é o body e o resultado da regra está no Head]

[Regras e Átomos]

Predictions. Given a rule R = B1 ∧ ... ∧ Bn ⇒ H and a substitution σ, we call σ(R) an instantiation of R. If σ(Bi) ∈ K ∀i ∈ {1, ..., n}, we call σ(H) a prediction of R from K, and we write K∧R |= σ(H). If σ(H) ∈ K, we call σ(H) a true prediction.

[Caso recupre átomos do KG]

A false prediction of a rule is a prediction of a counter-example of the rule. There are different approaches to define these counter-examples: Under the Closed World Assumption (CWA), any assertion that is not in the KB is considered a counter-example.

However, KBs are usually incomplete, and thus the CWA penalizes rules that predict new facts. Under the Open World Assumption (OWA), facts that are not in the KB are not necessarily wrong, and hence there are no counter-examples. This entails that a rule mining algorithm will report arbitrary rules as long as these rules make enough true predictions (such as “All people play the violin”). Therefore, AMIE [8] has proposed the Partial Completeness Assumption (PCA): If we have r(s,o) in the KB K, and if fun(r) ≥ fun(r−), then we assume that all r(s, o’) ... K do not hold in the real world. If fun(r) < fun(r−), then the PCA says that all r(s’, o) ... K do not hold in the real world. These assertions can thus serve as counter-examples. There are a number of other approaches to generate counter-examples in the literature [18].

[PCA para lidar com o OWA e gerar contra-exemplos]

Support and Confidence. The support of a rule R in a KB K is the number of true predictions p (of the form r(X, Y )) that the rule makes in the KB ... The confidence of a rule R in a KB K is the proportion of true predictions out of the true predictions and false predictions...  If the counter-examples are chosen by the PCA, we refer to the confidence as the PCA confidence and denote it by pca-conf (analogously for the CWA).

[Definição de Suporte e Confiança para as Regras]

In general, the support of the rule quantifies its relevance, whereas the confidence quantifies its accuracy. Rule mining is the task of finding all rules in a KB that fulfill certain confidence and support thresholds. It is a relaxation of inductive logic programming (ILP), in the sense that it finds also rules that predict some limited number of counter-examples (see [18] for a discussion).

4 AMIE 3

4.1 The AMIE Approach

The AMIE algorithm [8,9] is a method to mine closed Horn rules on large KBs. AMIE (Algorithm 1) takes as input a knowledge base K, and thresholds l for the maximal number of atoms per rule, minHC for the minimum head coverage, and minC for the minimum PCA confidence. AMIE uses a classical breadth-first search.

[Algoritimo original não escala pq é uma BFS]

4.2 AMIE 3

[Otimização do algoritmo]

[Otimização do armazenamento / memória]

Integer-Based In-Memory Database. AMIE uses an in-memory database to store the entire KB. Each fact is indexed by subject, by object, by relation, and by pairs of relation/subject and relation/object. In order to be able to load also large KBs into memory, AMIE compresses strings into custom-made ByteStrings, where each character takes only 8 bits.

[A otimizaçãoi não está só na lógica mas também na estrutura de dados a ser manipulada]

4.3 Quality Metrics

AMIE is a generic exhaustive rule miner, and thus its output consists of rules. These rules can serve as input to other applications, for example, to approaches that predict facts [5,15]. Such downstream applications may require different quality metrics.

[Gera TODAS as regras dentro dos parâmetros de corte]

5 Experiments

5.1 Experimental Setup

[WD dump de Julho de 2019, deve ser o truthy já que não usa qualificadores, deprecated poderiam ser contra exemplos?]

5.2 Effect of Our Optimizations

Laziness. As explained in Sect. 4.2, AMIE can invest a lot of time in calculating the PCA confidence of low-confident rules. The lazy evaluation targets exactly this problem. ... We observe that the parallel calculation of the overlap tables reduces drastically the contribution of this phase to the total runtime when compared to AMIE+—

[Calculado a posteriori e em paralelo]

5.3 Comparative Experiments

Results
It is not easy to compare the performance of OP, AMIE 3, and Rudik, because the systems serve different purposes, have different prerequisites, and mine different rules. Therefore, we ran all systems in their default configurations, and discuss the results (Table 6) qualitatively in detail.

The large search space is even more critical for OP on DBpedia 3.8 and Wikidata 2019, as the number of candidate rules grows cubically with the number of relations. We generated around 9 million candidate rules for DBpedia and around 114 million candidates for Wikidata. In both cases, OP mined all rules of size 2 in 1h 20min (≈21k candidates) and 14 h (≈100k candidates) respectively. However, it failed to mine any rule of size 3 in the remaining time. If we set the minimal support again to 1 and the CWA confidence threshold to 10−5, AMIE can mine twice as many rules as OP on DBpedia 3.8 in 33 min.

[Ajuste de parâmetros ]

In our experiment, RuDiK mined rules in Wikidata in 23 h. However, RuDiK was not able to mine rules for 22 of the relations as Virtuoso was not able to compute any of the positive or the negative examples RuDiK requires to operate. This is because RuDiK would timeout any SPARQL query after 20s of execution. Virtuoso failed to compute the examples during this time frame on the 22 relations, which are the largest ones in our Wikidata dataset: They cover 84% of the facts. Interestingly, RuDiK did also not find rules that contain these relations in the body (except one, which covered 0.5% of the KB).

[Problemas de timeout no Virtuoso]

AMIE 3 outperformed both OP and RuDiK in terms of runtime and the number of rules. Moreover, it has the advantage of being exact and complete.

Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...