Knowledge Graphs, Synthesis Lectures on Data, Semantics, and Knowledge

Knowledge Graphs, Synthesis Lectures on Data, Semantics, and Knowledge - Releitura Capítulos 3 e 4

Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, Antoine Zimmermann (2021) Knowledge Graphs, Synthesis Lectures on Data, Semantics, and Knowledge, No. 22, 1–237, DOI: 10.2200/S01125ED1V01Y202109DSK022, Morgan & Claypool

Chapter 3 - Schema, Identity, Context

https://www.emse.fr/~zimmermann/KGBook/Multifile/schema-identity-context/

We refer to a knowledge graph as a data graph potentially enhanced with representations of schema, identity, context, ontologies and/or rules. These additional representations may be embedded in the data graph, or layered above.

[Elementos adicionais aos vértices e arestas de um grafo de dados para formar um KG]

[As dimensões contextuais podem ser identificadas na fase de Engenharia do KG após as etapas de sumarização e profiling (engenharia reversa) e acrescentadas ao KG ou o mesmo já é construído com elas e usando as ontologias específicas]

Schema

We discuss three types of graph schemata: semantic, validating, and emergent.

A semantic schema allows for defining the meaning of high-level terms (aka vocabulary or terminology) used in the graph, which facilitates reasoning over graphs using those terms.

Aside from classes, we may also wish to define the semantics of edge labels, aka properties.

[Classes e propriedades/relações são definidas a nível de esquema. Em ontologias seriam as data properties e object properties]

Semantic schemata are typically defined for incomplete graph data, where the absence of an edge between two nodes ... does not mean that the relation does not hold in the real world. ... In contrast, if the Closed World Assumption (CWA) were adopted – as is the case in many classical database systems – it would be assumed that the data graph is a complete description of the world, thus allowing to assert with certainty that no flight exists between the two cities. Systems that do not adopt the CWA are said to adopt the Open World Assumption (OWA).

[OWA x CWA]

A compromise between OWA and CWA is the Local Closed World Assumption (LCWA), where portions of the data graph are assumed to be complete.

Validating schema

When graphs are used to represent diverse, incomplete data at large scale, the OWA is the most appropriate choice for a default semantics. But in some scenarios, we may wish to guarantee that our data graph – or specific parts thereof – are in some sense “complete”.

[No meu exemplo seria regra que assume que se não tem data de fim é corrente]

Thus while semantic schemata allow for inferring new graph data, validating schemata allow for validating a given data graph with respect to some constraints.

A shapes graph is formed from a set of interrelated shapes. Shapes graphs can be depicted as UML-like class diagrams...

When declaring shapes, the data modeller may not know in advance the entire set of properties that some nodes can have (now or in the future). An open shape allows the node to have additional properties not specified by the shape, while a closed shape does not.

Emergent schema

Both semantic and validating schemata require a domain expert to explicitly specify definitions and constraints. However, a data graph will often exhibit latent structures that can be automatically extracted as an emergent schema (aka graph summary ).

[Engenharia Reversa do KG]

In order to describe the structure of the graph, we could consider six partitions of nodes: event, name, venue, class, date-time, city. In practice, these partitions may be computed based on the class or shape of the node.

Various other forms of emergent schema not directly based on a quotient graph framework have also been proposed; examples include emergent schemata based on relational tables [Pham et al., 2015], and baseed on formal concept analysis [González and Hogan, 2018]. Emergent schemata may be used to provide a human-understandable overview of the data graph, to aid with the definition of a semantic or validating schema, to optimise the indexing and querying of the graph, to guide the integration of data graphs, and so forth. We refer to the survey by Čebirić et al. [2019] dedicated to the topic for further details.

[Survey lido. A abordagem de gerar o esquema reverso é considerada para KGs existentes na fase de Engenharia do KG]

Identity

Without further details, however, disambiguating nodes of this form may rely on heuristics prone to error in more difficult cases. To help avoid such ambiguity, first we may use globally-unique identifiers to avoid naming clashes when the knowledge graph is extended with external data, and second we may add external identity links to disambiguate a node with respect to an external source.

[Contexto de Identidade das Entidades envolvidas nas afirmações]

Datatypes

... specific dates and times in March 2020. This syntactic form is further recognisable by machine, meaning that with appropriate software, we could order such values in ascending or descending order, extract the year, etc.

[Os literais podem ser strings, datas, números inteiros, etc .... Tipos de dados como datas e coordendas GPS podem representar contexto temporal e espacial respectivamente, seriam as heurísticas básicas para detectar as dimensões]

Lexicalisation

Since identifiers can be arbitrary, it is common to add edges that provide a human-interpretable label for nodes ... indicating how people may refer to the subject node linguistically. Linguistic information of this form plays an important role in grounding knowledge such that users can more clearly identify which real-world entity a particular node in a knowledge graph actually references ...

Knowledge graphs with human-interpretable labels, aliases, comments, etc., (in various languages) are sometimes called (multilingual) lexicalised knowledge graphs

[Outras propriedades que compõem o contexto de Identidade da Entidade em uma busca exploratória]

[Modelar rótulos, identificadores e outras propriedades / relações que compõem o contexto de Identidade mas de modo a ser de acordo com o tipo de entidade]

Existential nodes

[Nós brancos, reificação, listas ordenadas, recurso sintático]

Context

Many (arguably all) facts presented in the data graph of Figure 2.1 can be considered true with respect to a certain context.

[Por isso usamos afirmações e não fatos]

By context we herein refer to the scope of truth, i.e., the context in which some data are held to be true [McCarthy, 1993, Guha et al., 2004].

[Definição que estamos adotando para a OWA onde a presença de uma afirmação em um KG não garante que seja verdadeira, aspecto da Veracidade em V's do Big Data]

However, making context explicit can allow for interpreting the data from different perspectives, such as to understand what held true in 2016, what holds true excluding webpages later found to have spurious data, etc. As seen previously, context for graph data may be considered at different levels: on individual nodes, individual edges, or sets of edges (sub-graphs).

[Motivação para trabalhar com contexto. O Contexto pode ser de uma entidade, de uma afirmação ou de um conjunto de afirmações]

Direct representation

The first way to represent context is to consider it as data no different from other data.

While in these examples context is represented in an ad hoc manner, a number of specifications have been proposed to represent context as data in a more standard way. One example is the Time Ontology [Cox et al., 2017], which specifies how temporal entities, intervals, time instants, etc. – and relations between them such as before, overlaps, etc. – can be described in RDF graphs in an interoperable manner. Another example is the PROV Data Model [Gil et al., 2013], which specifies how provenance can be described in RDF graphs, where entities (e.g., graphs, nodes, physical document) are derived from other entities, are generated and/or used by activities (e.g., extraction, authorship), and are attributed to agents (e.g., people, software, organisations).

[Já estamos considerando que Ontologias podem ser usadas para dimensões contextuais]

[Em KG hiper relacional as meta informações das arestas que representam contexto são parte do grafo e se associam a entidades que também fazem parte do grafo]

Reification

In general, a reified edge does not assert the edge it reifies; for example, we may reify an edge to state that it is no longer valid.

[Deixa de ser afirmação]

Higher-arity representation

First, we can use a named graph to contain the edge and then define the temporal context on the graph name. Second, we can use a property graph where the temporal context is defined as a property on the edge. Third, we can use RDF* [Hartig, 2017]: an extension of RDF that allows edges to be defined as nodes. Amongst these options, the most flexible is the named graph representation, where we can assign context to multiple edges at once by placing them in one named graph;

[Named Graphs seriam o mais razoável pq atendem a qualificadores para arestas individuais e para conjuntos de arestas.]

Annotations

Thus far, we have discussed representing context in a graph, but we have not spoken about automated mechanisms for reasoning about context; .... writing a query to manually intersect the corresponding temporal contexts will be difficult. An alternative is to consider annotations that provide mathematical definitions of a contextual domain and key operations over that domain that can be applied automatically.

[Como realizar inferências considerando o contexto]

Some annotations model a particular contextual domain; for example, Temporal RDF [Gutiérrez et al., 2007] allows for annotating edges with time intervals, such as Chile–president [2006,2010]➛M. Bachelet, while Fuzzy RDF [Straccia, 2009] allows for annotating edges with a degree of truth such as Santiago–climate 0.8➛Semi-Arid, indicating that it is more-or-less true – with a degree of 0.80.80.8 – that Santiago has a semi-arid climate.

Other forms of annotation are domain-independent; for example, Annotated RDF [Dividino et al., 2009, Udrea et al., 2010, Zimmermann et al., 2012] allows for representing context modelled as semi-rings: algebraic structures consisting of domain values (e.g., temporal intervals, fuzzy values, etc.) and two operators to combine domain values: meet and join.

[Será que aqui temos outros exemplos de contexto?]

Other contextual frameworks

Other frameworks have been proposed for modelling and reasoning about context in graphs. A notable example is that of contextual knowledge repositories [Serafini and Homola, 2012], which allow for assigning individual (sub-)graphs to their own context. Unlike in the case of named graphs, context is explicitly modelled along one or more dimensions, where each (sub-)graph takes a value for each dimension. Each dimension is associated with a partial order over its values – e.g., 2020-03-22 ⪯\preceq⪯ 2020-03 ⪯\preceq⪯ 2020 – enabling the selection and combination of sub-graphs that are valid within contexts at different granularities. Schuetz et al. [2021] similarly propose a form of contextual OnLine Analytic Processing (OLAP), based on a data cube formed by dimensions where each cell contains a knowledge graph. Operations such as “slice-and-dice” (selecting knowledge according to given dimensions), as well as “roll-up” (aggregating knowledge at a higher level) are supported.

[Dois trabalhos já lidos]

Chapter 4 - Deductive Knowledge

https://www.emse.fr/~zimmermann/KGBook/Multifile/deductive-knowledge/

As humans, we can deduce more from the data graph of Figure 2.1 than what the edges explicitly indicate.

[Mas é importante deixar explícito pq podemos deduzir algo com base no senso comum que não se aplica a todos os casos]

In these cases, given the data as premises, and some general rules about the world that we may know a priori, we can use a deductive process to derive new data, allowing us to know more than what is explicitly given by the data. These types of general premises and rules, when shared by many people, form part of “commonsense knowledge” [McCarthy, 1990]; conversely, when rather shared by a few experts in an area, they form part of “domain knowledge”, ...

[Os dois casos precisam ser explícitados pq as regras podem não ser consenso dentro de um domínio e o usuário explorando o KG pode não ser um especialista e sim alguém aprendendo sobre o domínio. Não existe o óbvio!]

In this way, we will be making more of the meaning (i.e., semantics) of the graph explicit in a machine-readable format. These entailment regimes formalise the conclusions that logically follow as a consequence of a given set of premises. Once instructed in this manner, machines can (often) apply deductions with a precision, efficiency, and scale beyond human performance. These deductions may serve a range of applications, such as improving query answering, (deductive) classification, finding inconsistencies, etc.

[As regras podem inferir tanto para máquinas quanto para pessoas]
[As inferências seriam uma forma de expandir / completar as respostas]

Ontologies

To enable entailment, we must be precise about what the terms we use mean.

[Semantica dos termos: relações, propriedades e classes]

Like all conventions, the usefulness of an ontology depends on the level of agreement on what that ontology defines, how detailed it is, and how broadly and consistently it is adopted. Adoption of an ontology by the parties involved in one knowledge graph may lead to a consistent use of terms and consistent modelling in that knowledge graph. Agreement over multiple knowledge graphs will, in turn, enhance the interoperability of those knowledge graphs.

[Selecionar Ontologias que possam representar as entidades envolvidas nas dimensões contextuais. Em KGs degenerados talvez só os datatypes sejam usados]

Interpretations and models

The distinction between nodes/edges and entities/relations becomes important when we define the meaning of ontology features and entailment. ... These assumptions (or lack thereof) define which interpretations are valid, and which interpretations satisfy which data graphs. We call an interpretation that satisfies a data graph a model of that data graph.

[CWA x OWA x LCWA, NUNA x UNA]

Ontology features

Ontology features for individuals: ASSERTION, NEGATION, SAME AS, DIFFERENTE FROM

Ontology features for property axioms: SUB-PROPERTY, DOMAIN,RANGE, EQUIVALENCE, INVERSE, DISJOINT, TRANSITIVE, SYMMETRIC, ASYMMETRIC, REFLEXIVE, IRREFLEXIVE,FUNCTIONAL, INV. FUNCTIONAL, KEY, CHAIN

Ontology features for class axioms and definitions: SUB-CLASS, EQUIVALENCE, DISJOINT, COMPLEMENT, UNION, INTERSECTION, ENUMERATION, SOME VALUES, ALL VALUES, HAS VALUE, HAS SELF, CARDINALITY, QUALIFIED cardinality,

Models under semantic conditions

Entailment

We say that one graph entails another if and only if any model of the former graph is also a model of the latter graph. Intuitively this means that the latter graph says nothing new over the former graph and thus holds as a logical consequence of the former graph.

[Um é consequencia do outro]

If–then vs. if-and-only-if semantics

Under if–then semantics – if Axiom matches the data graph then Condition holds in domain graph – the graphs do not entail each other: though both graphs give rise to the same condition, this condition is not translated back into the axioms that describe it. ... Conversely, under if-and-only-if semantics – Axiom matches data graph if-and-only-if Condition holds in domain graph – the graphs entail each other: both graphs give rise to the same condition, which is translated back into all possible axioms that describe it. Hence if-and-only-if semantics allows for entailing more axioms in the ontology language than if–then semantics.

[Será que faz diferença nas regras de interpretação do contexto dos KGs?]

Reasoning

Rules

A straightforward way to provide automated access to the knowledge that can be deduced through (ontological or other forms of) entailments is through inference rules (or simply rules) encoding if–then-style consequences. A rule is composed of a body (if) and a head (then). Both the body and head are given as graph patterns. A rule indicates that if we can replace the variables of the body with terms from the data graph and form a sub-graph of a given data graph, then using the same replacement of variables in the head will yield a valid entailment. The head must typically use a subset of the variables appearing in the body to ensure that the conclusion leaves no variables unreplaced. Rules of this form correspond to (positive) Datalog [Ceri et al., 1989] in Databases, Horn clauses [Lloyd, 1984] in Logic Programming, etc.

[Mesmo formato da regra para o contexto relativo do KG]

Rules can be leveraged for reasoning in a number of ways. Materialisation refers to the idea of applying rules recursively to a graph, adding the conclusions generated back to the graph until a fixpoint is reached and nothing more can be added.

[As regras seria computadas em tempo de consulta somente ou poderiam ser pré-computadas e armazenado para otimização]

The materialised graph can then be treated as any other graph. ... Another strategy is to use rules for query rewriting, which given a query, will automatically extend the query in order to find solutions entailed by a set of rules;

[Não vamos reescrever a consulta mas podemos flexibilizar usando disjunção (SOME VALUES) ou usar o contexto default e responder com esse contexto]

Various languages allow for expressing rules over graphs – independently or alongside of an ontology language – including: Notation3 (N3) [Berners-Lee and Connolly, 2011], Rule Interchange Format (RIF) [Kifer and Boley, 2013], Semantic Web Rule Language (SWRL) [Horrocks et al., 2004], and SPARQL Inferencing Notation (SPIN) [Knublauch et al., 2011], amongst others.

[Será que alguma linguagem seria útil para as regras?]

Languages such as SPIN represent rules as graphs, allowing the rules of a knowledge graph to be embedded in the data graph. Taking advantage of this fact, we can then consider a form of graph entailment G1∪γ(R)⊨ΦG2G_1 \cup \gamma(\mathcal{R}) \models_\Phi G_2G1∪γ(R)⊨ΦG2, where by γ(R)\gamma(\mathcal{R})γ(R) we denote the graph representation of rules R\mathcal{R}R.

[O KG e as regras não seria mais separados]

Description Logics

Description Logics (DLs) were initially introduced as a way to formalise the meaning of frames and semantic networks . Since semantic networks are an early version of knowledge graphs, and DLs have heavily influenced the Web Ontology Language, DLs thus hold an important place in the logical formalisation of knowledge graphs.

[Será que vale a pena investir mais em pesquisar sobre DL]

DLs are based on three types of elements: individuals, such as Santiago; classes (aka concepts) such as City; and properties (aka roles) such as flight.

DLs then allow for making claims, known as axioms, about these elements. Assertional axioms can be either unary class relations on individuals, such as City(Santiago), or binary property relations on individuals, such as flight(Santiago,Arica). Such axioms form the Assertional Box (A-Box).

DLs further introduce logical symbols to allow for defining class axioms (forming the Terminology Box, or T-Box for short), and property axioms (forming the Role Box, R-Box); for example, the class axiom City ⊑\sqsubseteq⊑ Place states that the former class is a sub-class of the latter one, while the property axiom flight ⊑\sqsubseteq⊑ connectsTo states that the former property is a sub-property of the latter one.

DLs may then introduce a rich set of logical symbols, not only for defining class and property axioms, but also defining new classes based on existing terms; as an example of the latter, we can define a class ∃\exists∃nearby.Airport as the class of individuals that have some airport nearby. Noting that the symbol ⊤\top⊤ is used in DLs to denote the class of all individuals, we can then add a class axiom ∃\exists∃flight.⊤⊑∃\top \sqsubseteq \exists⊤⊑∃nearby.Airport to state that individuals with an outgoing flight must have some airport nearby. Noting that the symbol ⊔\sqcup⊔ can be used in DL to define that a class is the union of other classes, we can further define, for example, that Airport ⊑\sqsubseteq⊑ DomesticAirport ⊔\sqcup⊔ InternationalAirport, i.e., that an airport is either a domestic airport or an international airport (or both)

A DL knowledge base consists of an A-Box, a T-Box, and an R-Box.

DL knowledge base
DL knowledge base K is defined as a tuple (A,T,R), where A is the A-Box: a set of assertional axioms; T is the T-Box: a set of class (aka concept/terminological) axioms; and R is the R-Box: a set of relation (aka property/role) axioms.

Pesquisa de Doutorado da Veronica

Pesquisar este blog