Pular para o conteúdo principal

Querying in the Age of Graph Databases and Knowledge Graphs - Leitura de Artigo

Marcelo Arenas, Claudio Gutierrez, and Juan F. Sequeda. 2021. Querying in the Age of Graph Databases and Knowledge Graphs. Proceedings of the 2021 International Conference on Management of Data. Association for Computing Machinery, New York, NY, USA, 2821–2828. DOI:https://doi.org/10.1145/3448016.3457545

SIGMO21

ABSTRACT
Graphs have become the best way we know of representing knowledge. ... Graph databases and knowledge graphs surface as the most successful solutions to this program. ...

*DBLP 2015

2 DATABASES, GRAPHS AND KNOWLEDGE GRAPHS

Knowledge, as understood e.g. in “knowledge representation”, “knowledge base” or “knowledge graph”, is a notion coming from the tradition of (formal) reasoning, and encompasses both, objects that statically represent knowledge (books, maps,charts, theorems, scientific laws, etc.), and mechanisms to dynamically obtain, deduce, or infer new knowledge from known premises or inputs (deductive systems, reasoners, neural networks, etc.). In fact, it is developed in [xx] the idea that a great part of the developments in data and knowledge since the 50s can be thought of as a way of giving digital support to the representations (e.g. tables, graphs, etc.) and mechanisms of traditional reasoning (e.g. reasoners, expert systems, etc.).

[Conhecimento não seria somente as entidades e relacionamentos representados, inclui também as regras de interpretação e geração de mais conhecimento]

Finally, information (which is not our concern here) speaks about semantics, meaning and interpretation for the subject that will use it. A database as such does not mean anything. It needs an interpretative apparatus, which is usually the schema and, more generally, some metadata. A digital text is a sequence of symbols. Again, needs a context, an interpretation, to get information from it. A similar phenomena occurs with a sequence of pixels: means nothing if not interpreted by a human eye or displayed by a visualization software.

[O suporte para interpretação da informação é a semântica, significado, contexto que meta-informação fornece]

2.2 Graph data and querying

Graphs have a simple data structure consisting of nodes and edges, which has the nice operational properties of expressing relations, presenting data in a rather holistic way (neither ordered nor sequentially), and last but not least, having a flexible structure that permits growing and shrinking (adding/deleting nodes and edges) and integration (of different graphs) in a natural way.

[Flexibilidade do modelo de grafo]

** Tipos de queries **

(i) local properties (nodes and neighborhoods), where pattern matching and its extensions play a key role, being usually approached with logical methods

[BGP/CGP]

(ii) connectivity, where paths and more complicated structures need to be extracted, usually by means of a mixture of regular expressions and logical methods; and

[RPG/NGP]
 
(iii) global properties, where the entire structure of a graph is considered, and that essentially need different methods and approaches than those of (i) and (ii). Such methods are usually put under the graph analytics umbrella

[Algoritmos: clusterização, PageRank, ...]

2.3 Some paradigmatic examples
 
To get a flavor of how these structures and ideas work in practice, let us briefly review three paradigmatic examples.

Semantic networks. One of the first attempt to represent knowledge in the form of graphs was the notion of Semantic Network... First born as a purely diagrammatic representation, soon researchers formalized those graphs to give them a formal semantics using logical methods. Graphs (called networks in this field) are used because they are historically very good objects to represent knowledge (in the static sense). Besides being a bridge to visualization of knowledge, semantic networks have the feature (shared with graphs in general) of highlighting and facilitating the discovery and representation of relationships.

[Semelhança entre KG e KB do tipo semantic network]

Graph Databases. Classical relational databases are flexible enough to represent graphs, e.g. using a two attribute relation representing edges in a graph. Moreover, one could consider relational schemata as generalization of binary relations (𝑚-ary relations). In this representation, nodes are entries and paths are constructed by successive joins. Why then do we need graph databases? There are at least two reasons: joins are expensive and thus, reasoning about paths –that are one the main motive to use graphs– becomes very costly. But furthermore, global properties are not easy to compute with classical queries based on logical methods.     

[Junções em BDs relacionais são custosas ... mas eu não tô focada em eficiência e sim em eficácia. Conseguiria representar afirmações com suas justificativas em tabelas? SIM ... mas Justificativas também são Afirmações logo seria necessário em auto-relacionamento aqui]

So, it comes as no surprise that people have tried since the early days of databases to develop graph databases. First at the hardware level with the hierarchical and network models of the sixties; then at a more logical level in the eighties where we found the “golden era” of graph databases. However, many constraints, among them the limitations of hardware and software, did not permit the popularization of these systems.

[Bancos de dados em grafo não são novos, década de 60]

Semantic Web. The Web is the first comprehensive system for representing, integrating and “producing” knowledge at big scale, whose key idea is taking advantage of the structure and features of the graph model (particularly the idea of integrating and extending knowledge and data by linking).

[Web Semântica para dados em grafo em larga escala]

2.4 Knowledge Graphs

The cases of Semantic Networks, Graph databases and the Semantic Web point to systems or software infrastructure in which graphs supports the representation, integration and production of knowledge. This rather fuzzy idea is what is at the core of the notion of Knowledge Graph. So we can define a Knowledge Graph as a software object (artifact) that represents (codifies), integrates and produces knowledge. In order to perform these functionalities, it relies on the model of graphs. In this regard, the notion of knowledge graph encompasses many previous developments. Knowledge representation is incorporated mainly through standardized languages, metadata and ontologies on one hand, and semantic networks and other non-formal language representations on the other.

[Nem todo KG usa os padrão da Web Semântica. Nova definição de KG]

Finally, producing new knowledge is probably the main “added-value” as opposed to classical repositories of knowledge. Knowledge graphs have the capabilities of: deducing, e.g. by means of logical reasoners or neural networks; linking, that is, relating different pieces of knowledge beforehand isolated; learning, through new data and learning algorithms; and of course, generating new knowledge by human intervention through refined ways of querying it. This new knowledge is not only user-oriented, but is utilized in parallel to “complete” and enrich the knowledge already present.

[A regras para geração da mais conhecimento fazem parte do KG]

3 GRAPH DATA MODELS

The following are two basic ingredients to define graph data models.

Assume that Const is a set of constants, or strings, that can be used for different purposes, for example as node identifiers, edge identifiers, labels, property names or actual values (such as integers, real values or dates).

[Interpretação Const]

Moreover, define a multigraph as a graph where multiple edges can connect two nodes, that is, a tuple (𝑁,𝐸,𝜌) where 𝑁 ⊆ Const is a set of nodes, 𝐸 ⊆ Const is a set of edges and 𝜌 : 𝐸 → 𝑁 × 𝑁 is a function indicating the starting and ending node of each edge.

[Definição de multigrafo e de aresta]

As a first data model, we consider labeled graphs, which are a popular and simple way to represent semi-structured data. Formally, a labeled graph is a tuple L = (𝑁,𝐸,𝜌,𝜆) where (𝑁,𝐸,𝜌) is a multigraph and 𝜆 : (𝑁 ∪ 𝐸) → Const is a function indicating the label of each node and edge. Such graphs have been called heterogeneous graphs in the literature, as opposed to edge-labeled graphs where labels are only associated to edges. But here we prefer the simple term labeled graph to indicate that both nodes and edges are labeled.

[Grafo rotulados, nós e arestas rotuladas, heterogêneos pq tem mais de um tipo de rótulo para nós]

A first characteristic that distinguish RDF graphs is that edges are replaced by triples, and they are not assigned identifiers. Formally, an RDF graph is a set of triples (𝑠,𝑝,𝑜) such that 𝑠,𝑝,𝑜 ∈ Const, so that (𝑠,𝑝,𝑜) represents an edge from 𝑠 to 𝑜 with label 𝑝. A second important feature of RDF graphs is that Const is considered as a set of Uniform Resource Identifiers (URIs), that can be used to identify any resource used by Web technologies. In this way, RDF graphs have a universal interpretation: if 𝑐 ∈ Const is used in two different RDF graphs, then 𝑐 is considered to represent the same element

[RDF não tem identificador nas arestas]

As a second model we consider property graphs, which are widely used in graph databases. Property graphs are defined as the extension of labeled graphs where nodes and edges can have values for some properties. Formally, a property graph is a tuple P = (𝑁,𝐸,𝜌,𝜆,𝜎) where (𝑁,𝐸,𝜌,𝜆) is a labeled graph, and 𝜎 : (𝑁 ∪ 𝐸) × Const → Const is a partial function such that if 𝜎 (𝑜, 𝑝) = 𝑣, then 𝑣 is said to be the value of property 𝑝 for object 𝑜. Besides, it is assumed that each node or edge in P has values for a finite number of properties.

[Modelo LPG, não reforça que não tem identificador nas arestas também mas estabele que o par chave-valor das propriedade não são arestas e nós]

Formally, a vector-labeled graph of dimension 𝑑, with 𝑑 ≥ 1, is a tuple V = (𝑁,𝐸,𝜌,𝜆) where (𝑁,𝐸,𝜌) is a multigraph and 𝜆 : (𝑁 ∪ 𝐸) → Const𝑑 is a function that assigns a vector of values to each node and edge in the graph, which is called a vector of features of dimension 𝑑. Hence, labels and properties are replaced by vectors of values from Const in vector-labeled graphs,

[Representar nós e arestas como vetores de dimensão d, representação sub simbólica com embeddings]

4 QUERYING: SOME NEW CHALLENGES AND TECHNIQUES

4.1 A necessary terminology

As mentioned before, extracting nodes and paths is a fundamental task when retrieving knowledge from graphs. Regular expressions form the core of such an extraction task, so we need to fix a notation for regular expressions before going into the details of the results shown in this section.

[Não tem uma linguagem abstrata, logo todos precisam estabelecer uma notação]

4.2 Local properties

The task of matching a pattern against a graph is fundamental when extracting knowledge. In the previous section, we considered this problem for regular expressions, but such patterns can be specified in other frameworks, ranging from logic-based declarative languages to more procedural frameworks such as graph neural networks. The goal of this part of the document is to show a recently established tight connection between these apparently different frameworks, which has interesting corollaries in terms of the use of declarative formalisms to specify patterns, versus the use of procedural formalisms to efficiently evaluate them.

This regular expression can be specified in first-order logic:
𝜑 (𝑥) = person(𝑥) ∧ ∃𝑦∃𝑧 (rides(𝑥, 𝑦) ∧ bus(𝑦) ∧ rides(𝑧, 𝑦) ∧ infected(𝑧)),
considering node labels as unary predicates and edge labels as binary predicates. Notice that only unary and binary predicates are used in it, and they are placed in a sequence in which values of variables can be forgotten, allowing them to be reused.

[Usou lógica de primeira ordem]

...

4.3 Connectivity properties

Computing the complete set of answers to a graph query can be prohibitively expensive. As a way to overcome this limitation, the idea of enumerating the answers to a query with a small delay has recently attracted much attention. More specifically, the computation of the answers is divided into a preprocessing phase, where a data structure is built to accelerate the process of computing answers, and then in an enumeration phase, the answers are produced with a polynomial-time delay between them.... However, how can we know how complete is the set of answers calculated by such algorithms?

[Identificar caminhos, problema de eficiência]

...

4.4 Global properties

Graph analytic makes reference to a series of techniques to analyze the structure and content of a graph as a whole. Typical applications include clustering [73], computation of connected components and the diameter of a graph, computation of shortest paths between pairs of nodes, calculation of centrality measures [61], such as betweenness centrality [34 ] and PageRank [ 23], and community detection, such as finding the subgraph of a graph with the largest density [36 , 54 ], to identify groups with a rich interaction in a network [48] or groups with suspicious behaviour [47, 63].

How should knowledge be included in such techniques?

[Além de nós e arestas, que tipo de regra poderia ser agregada a essas operações? ]

...

As a final remark, it is an interesting open problem how to define a declarative language for model interpretability, which not only should be expressive enough to represent the aforementioned interpretation queries, but also should be adequate for efficient implementation. Notice that this latter problem is directly related with the representation used for classification models, and the structural properties of this representation. For instance, the labeled graph representing a classification model can be a decision tree, in which case the aforementioned interpretation queries can be solved efficiently [19].

[Não entendi, o Daniel falou em árvore de decisão]

5 TAKEAWAY MESSAGES

Graphs have become ubiquitous in data and knowledge management. We argued that one of the main drivers of this blooming is the dual character of graphs: on one hand, being a simple, flexible and extensible data structure; and on the other, being one of the most deep-rooted form of representing human knowledge.

[Flexível e com capacidade de raciocíneo]

Graphs (as representation) unveil social aspects that are relatively far from being the main concerns of our area. Traditionally, we divided our labor between designers, organizing the conceptual boxes (through schemata and metadata) that contain data; and we, data people, dealing with preserving and transforming such data. Knowledge graphs mixed both worlds making difficult to trace a clear frontier between them.

[Dados e Metadados juntos, Informação e Meta-informação juntos]

Comentários

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s...

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russe...

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The...