Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases

Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases - Leitura de Artigo (Weikum)

Gerhard Weikum, Xin Luna Dong, Simon Razniewski, Fabian M. Suchanek: Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases. Found. Trends Databases 10(2-4): 108-490 (2021)

This machine knowledge can be harnessed to semantically interpret textual phrases in news, social media and web tables, and contributes to question answering, natural language processing and data analytics.

Knowledge harvesting methods have enabled the automatic construction of knowledge bases (KB): collections of machine-readable facts about the real world.

[objetos do mundo real mas também existem conceitos, abstratos. Fatos mas na realidade podem ser Afirmações, verdadeiras ou falas]

1.3 Application Use Cases

Knowledge bases enable or enhance a wide variety of applications.

Semantic Search and Question Answering:

All major search engines have some form of KB as a background asset. Whenever a user’s information need centers around an entity or a specific type of entities, ...., the KB can return a precise and concise list of entities rather than merely giving “ten blue links” to web pages. The earlier example of asking for “dylan protest songs” is typical for this line of semantic search. Even when the query is too complex or the KB is not complete enough to enable entity answers, the KB information can help to improve the ranking of web-page results by considering the types and other properties of entities.

[KB para auxiliar na ordenação dos resultados de IR]

An additional step towards user-friendly interfaces is question answering (QA) where the user poses a full-fledged question in natural language and the system aims to return crisp entity-style answers from the KB or from a text corpus or a combination of both. An example for KB-based QA is “Which songs written by Bob Dylan received Grammys?”; answers include All Along the Watchtower, performed by Jimi Hendrix, which received a Hall of Fame Grammy Award. An ambitious example that probably requires tapping into both KB and text would be “Who filled in for Bob Dylan at the Nobel Prize ceremony in Stockholm?”; the answer is Patti Smith.

[KB para Q&A mais complexas]

Overviews on semantic search and question answering with KBs include [36, 128, 345, 526, 641].

*** LIDO *** [36] H. Bast, B. Buchhold, and E. Haussmann. “Semantic Search on Text and Knowledge Bases”. Foundations and Trends in Information Retrieval. 10(2-3): 119–271. 2016.

[128] D. Diefenbach, V. López, K. D. Singh, and P. Maret. “Core techniques of question answering systems over knowledge bases: a survey”. Knowledge and Information Systems (KAIS). 55(3): 529–569. 2018

[345] C. Lei, F. özcan, A. Quamar, A. R. Mittal, J. Sen, D. Saha, and K. Sankaranarayanan. “Ontology-Based Natural Language Query Interfaces for Data Exploration”. IEEE Data Engineering Bulletin. 41(3): 52–63. 2018.

*** LER *** [526] R. Reinanda, E. Meij, and M. de Rijke. “Knowledge Graphs: An Information Retrieval Perspective”. Foundations and Trends in Information Retrieval. 2020.

[641] C. Unger, A. Freitas, and P. Cimiano. “An Introduction to Question Answering over Linked Data”. In: Reasoning Web Summer School. Springer, 2014

Language Understanding and Text Analytics:

Both written and spoken language are full of ambiguities. Knowledge is the key to mapping surface phrases to their proper meanings, so that machines interpret language as fluently as humans. AI-style use cases include machine translation, and conversational assistants like chatbots. Prominent examples include Amazon’s Alexa, Apple’s Siri, and Google’s Assistant and new chatbot initiatives [2], and Microsoft’s Cortana.

Understanding entities (and their attributes and associated relations) in text is also key to large-scale analytics over news articles, scientific publications, review forums, or social media discussions

[Outras aplicação com NLP]

Data Cleaning:

Coping with incomplete and erroneous records in large heterogeneous data is a classical topic in database research (see, e.g., [511]). The problem has become more timely and pressing than ever. Data scientists and business analysts want to rapidly tap into diverse datasets, for comparison, aggregation and joint analysis. So different kinds of data need to be combined and fused, more or less on the fly and thus largely depending on automated tools. This trend amplifies the crucial role of identifying and repairing missing and incorrect values.

[Integração de Dados com o KB como base de referência]

1.4.1 What is a Knowledge Base

A knowledge base (KB) is a collection of structured data about entities and relations with the following characteristics:

• Content: The data contains entities and their semantic types for a given domain of interest. Additionally, attributes of entities (including numeric and string literals) and relationships between entities are captured. The domain of interest can be of broad enyclopedic nature ... or can have specific themes such as indie music or medical and nutritional health products, with a long tail of relevant entities.

[Um único domínio em profundidade ou vários domínios em pouca cobertura]

• Quality: We expect the KB content to be of near-human quality, with the rate of invalid statements below the error rate that a collection with expert-level curation would achieve. The KB content should be continuously updated for freshness, and maintained in a consistent way (e.g., no contradictory statements).

[Podem haver fatos contraditórios quando existem várias fontes e várias perspectivas]

• Schema and Scale: Unlike a conventional database, there is often no pre-determined relational schema where all knowledge has to fit into a static set of relations. ... Therefore, KBs adopt the dataspace “pay-as-you-go” principle [221]: the content is augmented and refined by adding new types, attributes and relations, as the KB grows.

[Esquema flexível]

• Open Coverage: An ideal KB would contain all entities and their properties that are of interest for the domain or enterprise. ... Therefore, we have to view KB construction and maintenance as a “never-ending” task, following an open world assumption and acknowledging the high pace of real-world changes.

[KB é inerentemente incompleto ... OWA]

Information Retrieval (IR): With search engines being the premier use case of knowledge bases, the IR community has shown great interest in KB methodology, especially knowledge extraction from text which is a major focus of this article. For querying, answer ranking and related IR topics, we refer the reader to the survey by Reinanda et al. [526] with focus on the IR dimension.

*** LER *** [526] R. Reinanda, E. Meij, and M. de Rijke. “Knowledge Graphs: An Information Retrieval Perspective”. Foundations and Trends in Information Retrieval. 2020.

Instance-of vs. Subclass-of:

Some KBs do not make a clear distinction between classes and instances, and they collapse the instance-of and subclass-of relations into a single is-a hierarchy. Instead of stating that Bob Dylan is an instance of the class singers and that singers are a subclass of musicians, they would view all three as general entities and connect them in a generalization graph.

[Esse problema ocorre na Wikidata]

As most KBs are of encyclopedic nature, the instances of a relation are often referred to as facts. We do not want to exclude knowledge that is not fact-centric (e.g., commonsense knowledge with a socio-cultural dimension); so we call relational instances more generally statements. The literature also speaks of facts, and sometimes uses the terminology assertion as well. For this article, the three terms statement, fact and assertion are more or less interchangeable.

[Conhecimento não é só fato (aqui fato é verdade absoluta), existe conhecimento de senso comum. Statement é a unidade de conhecimento do KB]

In logical terms, statements are grounded expressions of first-order predicate logic (where “grounded” means that the expression has no variables). In the KB literature, the term “relation” is sometimes used to denote both the relation identifier R and an instance 〈x1, ..., xn〉. We avoid this ambiguity, and more precisely speak of the relation and its (relational) tuples.

[Mesmo "desconforto" do Sérgio em usar relação e relacionamento de modo intercambiável, o Weikum é de BD]

In addition to their attributes, entities are characterized by their relationships with other entities, for example, the birthplaces of people, prizes won, songs written or performed, and so on. Mathematical relations over classes, as introduced above, are the proper formalism for representing this kind of knowledge. The frequent case of binary relations captures the relationship between exactly two entities. ... Some KBs emphasize binary relations only, leading to the notion of knowledge graphs (KGs). However, ternary and higher-arity relations can play a big role, and these cannot be directly captured by a graph.

[KBs pode ir além das representações de grafos rotulados e direcionados]

At first glance, it may seem that we can always decompose a higher-arity relation into multiple binary relations. ... However, not every higher-arity relation is decomposable without losing information.

Consider a quarternary relation won : person × award × year × f ield capturing who won which prize in which year for which scientific field. Instances would include
〈 Marie Curie, Nobel Prize, 1903, physics 〉
and
〈 Marie Curie, Nobel Prize, 1911, chemistry 〉.
If we simply split these 4-tuples into a set of binary-relation tuples (i.e., SPO triples), we would end up with:
〈MarieCurie, NobelPrize〉, 〈MarieCurie, 1903〉, 〈MarieCurie, Physics〉,
〈MarieCurie, NobelPrize〉, 〈MarieCurie, 1911〉, 〈MarieCurie, Chemistry〉

Leaving the technicality of two identical tuples aside, the crux here is that we can no longer reconstruct in which year Marie Curie won which of the two prizes. Joining the binary tuples using database operations would produce spurious tuples, namely, all four combinations of 1903 and 1911 with physics and chemistry.

[Exemplo interessante para demostrar que triplas são insuficientes]

The Semantic Web data model RDF and its associated W3C standards (including the SPARQL query language) support only binary relations. They therefore exploit clever ways of encoding higher-arity relations into a binary representation, based on techniques related to reificiation [656]. Essentially, each instance of the higher-arity relation is given an identifier of type statement and that identifier is combined with the original relation’s arguments into a set of binary tuples.

The downside of reification and related techniques for casting n-ary relations into RDF is that they make querying more difficult if not to say tedious. It requires more joins, and considering paths and not just single edges when dealing with compound nodes in the graph model. For this reason, some KBs have also pursued hybrid representations where for each higher-arity relation, the most salient pair of arguments are represented as a standard binary relation and reification is used only for the other arguments.

[Problemas da reificação]

2.1.5 Logical Invariants

In addition to the grounded statements about entities, classes and relational properties, KBs can also contain intensional knowledge in the form of logical constraints and rules.

This theme dates back to the pioneering work on hand-crafted KBs like Cyc (Guha and Lenat [208]) and SUMO (Niles and Pease [462]), and is of great importance for automated knowledge harvesting as well. The purpose of constraints is to enforce the consistency of the KB: grounded statements that violate a constraint cannot be entered. For example, we do not allow a second birthdate for a person, as the birthdate property is a function, and we require creators of songs to be musicians (including composers and bands). The former is an example of a functional dependency, and the latter is an example of a type constraint.

[Regras para consistência de dados e inferências]

Automation vs. Human in the Loop: It is desirable to automate the process of KB creation and curation as much as possible. Ideally, no human effort would be required at all. However, this is unrealistic if we also strive for the other requirements in this list of desiderata. Human inputs may range from manually specifying the schema for types and properties of interest, along with consistency constraints, to judiciously selecting input sources, all the way to providing the system with labeled training samples for supervised knowledge acquisition. We can obtain this suite of inputs from highly qualified knowledge engineers and domain experts, or from crowdsourcing workers at larger scale. All this is about the trade-off between achieving high correctness, coverage and expressiveness, and limiting or minimizing the monetary cost of building and maintaining the KB.

[Trade-off para usar HITL, usar somente onde não for possível automatizar]

Coping with Trade-offs:

Often, the prioritization of requirements and corresponding design choices depend on the intended downstream applications.

For example, when the use case is a search engine for the Internet or broad usage in an enterprise intranet, the most important point is to identify entities and classes (or categories and labeled lists) in the users’ keywords, telegraphic phrases or full questions, matching them against entities in documents and the KB (see, e.g., [368, 80]). Statements on relationships between entities would be needed only in complex queries which are infrequent on the Internet. Hence, KBs for search engines may well restrict their properties to basic ones of wider interest, like spouses and awards (relevant for search about entertainment celebrities), or songs and albums for musicians (but not composer, instrumentation, lyrics etc.)

[Busca na Web são menos complexas, factuais, sobre entidades do mundo real]

Answering Internet queries may even tolerate some errors in the entity identification and in the KB itself, as long as there is a good match between user input and search-engine results. For example, users searching for “Oscar winners” and users looking for “Academy Award winners” should ideally be provided with the same answers, if the KB can tell that “Oscar” and “Academy Award” denote the same entity. But if the KB lacks this knowledge and assumes two different entities, all users would still be satisfied with ample of good results regardless of their query formulations, because of the topic’s large coverage on the Internet.

[Pq os usuários estariam satisfeitos?]

Wikidata circumvents the restrictions of triple-based knowledge representation by reification (e.g., Macron’s inauguration as French president is itself an entity) and by qualifiers for refining SPO triples (cf. Section 2.1.3). Qualifiers are a predicates that enrich triples with context, about sources, dates, reasons, etc. For example, the spouse property comes with qualifiers for wedding date and divorce date. Awards, such as Bob Dylan winning the Nobel Prize, have qualifiers for date, location, field, prize money, laudation speaker, etc.

[Qualificadores nas arestas da Wikidata - hiper relacional]

An important principle of the collaborative community is to allow different perspectives.

Therefore, Wikidata refers to statements as claims. These do not necessarily capture a single view of the world, with universally agreed-upon facts. Rather it is possible (and often happens) that alternative standpoints are inserted as separate statements for the same entity and property. For example, it is accepted that Jesus has more than one birth date, and the same holds for the death date of the mountaineer George Mallory (who died somewhere on Mount Everest with unknown precise date). In principle, Wikidata would even tolerate entering alternative death places for Elvis Presley (including perhaps Mare Elysium on Mars), but the community cross-checks edits and may intervene in such a case.

[Wikidata permite diferentes visões de mundo. Não há Verdade Absoluta.]

Wolfram Alpha is a computational knowledge engine [266, 676] that was developed in the late 2000s, to support fact-centric question answering (QA). The Wolfram company is best known for its computer-algebra system Mathematica. The KB enhances computations with Mathematica by encyclopedic facts such as location coordinates and distances. Conversely, the QA interface, accessible at http://wolframalpha.com, makes use of computational inference by functions of the Mathematica library. The KB that underlies Wolfram Alpha is built by importing and integrating facts from a variety of structured and well curated sources, such as CIA World Factbook, US Geological Survey, Dow Jones, feeds on weather data, official statistics, and more.

[Curadoria das fontes de dados para construção do KB]

The specific strength of Wolfram Alpha is the ability to compute derived knowledge. For example, one can ask about the age of Bob Dylan: the returned answer, as of January 13, 2021, is “79 years, 7 months and 19 days”. Obviously, this is not stored in the KB, but derived by a functional program from the KB statement about Dylan’s birthdate. More advanced cases of this kind of computational knowledge are statistical aggregations, comparisons, rankings, trends, etc., mostly regarding geography, climate, economy and finance. The QA service and other APIs of Wolfram Alpha have been licensed to other companies, including usage for Apple Siri and Amazon Alexa.

[Inferência através das funções, seria possível infereir com as queries]

Human in the Loop:

Throughout this article, we have emphasized the goal of automating KB construction and curation as much as possible. However, human inputs do play a role and their extent can vary widely. The most obvious step is manually checking the validity of KB statements for quality assurance in industrial-strength KBs, which can adopt a variety of strategies (see, e.g., [626]).

Going beyond this stage, seed-based learning benefits from hand-crafted seeds and from feedback on learned patterns and rules (see, e.g., Section 4.3 and Section 9.3). This human intervention serves to flag incorrect or misleading signals for knowledge extraction as early as possible, and helps steering the KB construction process towards high quality.

[HITL para melhorar a qualidade do KB]

It is also conceivable to rely entirely on humans for KB construction. This can either be via human experts (as in the early days of Cyc [347]) or crowdsourcing workers (as for large parts of ConceptNet [597] and ImageNet [124]). Wikidata builds on community inputs [653], where many contributors have programming skills and provide bulk imports from high-quality data sources.

Yet another variant of the human-in-the-loop theme is games with a purpose (such as ESP [6] or Verbosity [652]).

[Gamificação]

The natural conclusion is to carefully combine human input with large-scale automation. How do this in the best way for low cost and high quality, is another research challenge for next-generation knowledge bases (see, e.g., [135, 134]).

8.1.1 Quality Metrics

The degrees of correctness and coverage of a KB are captured by the metrics of precision and recall, analogously to the evaluation of classifiers. Precision and recall matter on several levels: for a set of entities that populate a given type, for the set of types known for a given entity, for the set of property statements about a given entity, or for all entities of a given type. For all these cases, assume that the KB contains a set of statements S whose quality is to be evaluated, and a ground-truth set GT for the respective scope of interest.

[Sob OWA esse GT é hipotético]

No knowledge base can ever be fully complete. This suggests that, unlike database, which are traditionally interpreted under a Closed-World Assumption (CWA), we should treat KBs as following an Open-World Assumption (OWA):

The Open-World Assumption (OWA) postulates that if a statement is not in the KB, it may nevertheless be true in the real world. Its truth value is unknown.

In other words, absence of evidence is not evidence of absence.

Whenever we probe a statement and the KB does not have that statement (and neither a perfectly contradictory statement), we assume that this statement may be true or may be false – we just do not know its validity

Notwithstanding the general OWA, we are often able to observe and state that a KB is locally complete.

[Como mapear para quais predicados é completo? Poderia ser uma regra? Uma query?]

The design philosphy of KBs is to store positive statements only: facts that do hold. However, it is sometimes also of interest to make negative statements explicit: statements that do not hold, despite common belief or when they are otherwise noteworthy.

The OWL 2 standard [654] provides syntax to capture such formulas. However, specific KBs like Wikidata have only limited ways of expressing negative statements in a principled and comprehensive manner. Wikidata, for example, has statements of the form 〈 Angela Merkel, child, no value〉 where no value has the semantics that no object exists (in the real world) for this SP pair. Essentially, this asserts the LCA for this local context, confirming that the absence of objects is indeed the truth (as opposed to merely not knowing any objects). In addition, statements about counts such as 〈 Angela Merkel, number of children, 0〉 can capture empty sets as well.

The negation of fully grounded statements is not expressible in RDF, whereas it is straightforward in the OWL 2 language [654]. Obviously, it does not make sense to add all (or too many) negative statements even if they are valid.

[No hiper relacional existiria uma forma padrão para a negação?]

Constraints define invariants that must be satisfied by the KB to be logically consistent and fulfilling a necessary condition for being correct.

Rules define a calculus to deduce additional statements, to complete the KB and make it logically consistent.

[Regras dedutivas para gerar conhecimento]

Soft rules express plausibility restrictions, holding for most cases but tolerating exceptions.

[Mineração de regras, % de cobertura da regra em relação as instâncias]

The vector representation has several use cases. It is a good way of quantifying similarities between entities, and thus enables statistical methods like clustering. Also, deep learning requires real-valued vectors as input. The most prominent application is to predict missing statements for the KB. The literature has referred to this as knowledge graph completion.

[Predição de links para completar o grafo]

In a KB, each statement should be annotated with provenance metadata about:
• the source (e.g., web page) from where the statement was obtained, or sources when multiple inputs are combined,
• the timestamp(s) of when the statement was acquired, and
• the extraction method(s) by which it was acquired, for example, the rule(s), pattern(s) or classifier(s) used.
This is the minimum information that a high-quality KB should capture for manageability.

[Proveniência em KB]

For query processing, provenance information can be propagated through query operators. This allows tracing each query result back to the involved sources [66], an important element of explaining answers to users in a human-comprehensible way. Provenance information can be added as additional arguments to relational statements, in the style of this example:
〈Bob Dylan, won, Nobel Prize in Literature,
source: www.nobelprize.org/prizes/literature/2016/dylan/facts,
extractor:lstm123〉,
or by means of reification and composite objects in RDF format (see Section 2.1.3). Another option is to group an entire set of statements into a Named Graph

[Como representar? No hiper relacional é de modo nativo, no RDF é com reificação]

The most common solution, however, is what is known as Quads: triples that have an additional component that serves as an identifier of the triple. Then, provenance statements can be attached to the identifier. Conceptually, this technique corresponds to a named graph of exactly one statement; it is used, for example, in YAGO 2 [258] and in Wikidata [653]. Various triple stores support quads.

[Quad é a adição de ID na aresta. Pode ser usada para representação de qualquer contexto, não só proveniência]

Temporal Scopes:
We aim for temporally scoped statements, by annotating SPO triples with their validity times, which can be timepoints or time intervals. This can be expressed either via higher-arity relations, such as

wonPrize (EnnioMorricone, Grammy, 11-February-2007)
wonPrize (EnnioMorricone, Grammy, 8-February-2009)
capital (Germany, Bonn, [1949-1990])
capital (Germany, Berlin, [1991-now])

or by means of reification and composite objects

[Uso de hiper relacional com ID na aresta para o contexto temporal]

Quality measures for correctness and coverage are computed by sampling statements with human judgements, by crowdsourcing or, if necessary, experts.

9.4 Knowledge Sharing Community: Wikidata

For this survey article, three aspects of Wikidata are especially relevant: i) the way how knowledge in Wikidata is organized, ii) the way how the schema evolves in this collaborative setting, and iii) the role that Wikidata plays as a hub for entity identification and interlinkage between datasets

[Modelo WD com claims e statements, colaborativo e mapeamento de ID externos]

An important principle of the collaborative community is to allow different perspectives. Therefore, Wikidata refers to statements as claims. These do not necessarily capture a single view of the world, with universally agreed-upon facts. Rather it is possible (and often happens) that alternative standpoints are inserted as separate statements for the same entity and property.

Evolving Scope and Focus:

The scope and focus of Wikidata are repeatedly coming into discussion, especially when new data imports are discussed. In 2019, for instance, debates revolved around the importing of scholarly data (mostly identifiers for authors and publications), which currently makes up 40% of Wikidata’s entities. Such imbalances could possibly bias functionality for search and ranking, and puts strain on Wikidata’s infrastructure, potentially at the cost of other stakeholders. Therefore, decisions on in-scope and out-of-scope topics are a recurring concern.

[É de propósito geral mas o domínio de dados academicos domina o conteúdo]

9.5.5 Search and Question Answering with Machine Knowledge

Industrial knowledge bases at Internet companies have been applied in four major categories: search, recommendation, information display, and question answering / conversations. Search and recommendation are the major ways for users to discover new information, including not only web pages, news, images and videos, but also specific entities such as products, businesses, music, movies, house-for-rent, and so on ... Exploiting knowledge for recommendation and conversation are important issues, too, but still open research topics.

According to the market research company SparkToro [179], as of June 2019, half of Google Search resulted in “zero-clicks”: the users were directly satisfied with seeing a knowledge panel, carousel or other KB excerpt, as opposed to the traditional result page of ten blue links with preview snippets.

Taking knowledge panels as an example, there are several decisions to make by the search engine:
• D1: Shall the query trigger the display of a knowledge panel? For example, a knowledge panel is appropriate for the query “Taj Mahal”, but not for queries in the style of “democrats vs. republicans” or “who will win the upcoming election?”.

[Perguntas mais complexas]

• D2: If so, which entity is the query asking about? For the “Taj Mahal” query, shall we return the famous Indian palace or the American musician with the stage name “Taj Mahal”? Or should we show both and let the user choose?
• D3: Which attributes and relations shall be shown for this entity, and how should they be ordered?

In addition to the case of zero-click search results, the KB also helps to improve standard search results in several ways (standard search is nowadays often referred to as “organic search”, as opposed to paid/sponsored search results). First, with the entities and properties in the KB, both query understanding and document interpretation are improved, feeding new features to the search engine to leverage the query-and-click log to train better ranking models (see, e.g., [129, 526] and references there). The techniques are further improved using contextual language models such as BERT [127]. Second, KBs help in various search-assistance tasks, including query auto-completion, spelling correction, query reformulation and query expansion (see, e.g., [526]).

[Usar o KB para o ranking]

While personal knowledge is still factual but refers only to one individual, there are also use cases for capturing subjective knowledge, regarding beliefs and argumentations. For example, it could be of interest to have statements of the form:
Fabian believes id1: (Elvis livesOn Mars)
Gerhard believes id2: (Dylan deserves PeaceNobelPrize)
Luna supports Gerhard on id2
Simon opposes Gerhard on id2
These kinds of statements, with second-order predicates about attribution and stance, are essential to capture people’s positions and potential biases in discussions and controversies.

[Representar Crenças e Oposições]

Pesquisa de Doutorado da Veronica

Pesquisar este blog

Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases - Leitura de Artigo (Weikum)

Gerhard Weikum, Xin Luna Dong, Simon Razniewski, Fabian M. Suchanek: Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases. Found. Trends Databases 10(2-4): 108-490 (2021)

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

DGL-KE : Deep Graph Library (DGL)