Pular para o conteúdo principal

LIquid: The soul of a new graph database, Part 2

Nesse primeiro post está a parte 1

A parte 2 está disponível em https://engineering.linkedin.com/blog/2020/liquid--the-soul-of-a-new-graph-database--part-2

Acessado em 06/01/2021

Como o grafo LIquid é representado na forma relacional?

In fact, representing a graph as an adjacency list of edges in a tabular relational system is easy:

Uma tabela de vértices com id e literal e uma tabela de arestas na forma sujeito, predicado, objeto (onde cada elemento é um vértice) ... mais parecido com LPG do que RDF

Vantagens: não requer tratamento de NULOS, modelo relacional é bem estudado e fundamentado, não requer normalização e fácil portabilidade entre SGBDs.

Desvantagens: Self-Joins e problema de impedância entre o SQL e OO

If the table is implemented by a sorted structure, some form of B-tree is nearly universal in tabular relational systems, random probing is O(log(N)), and N and the constant factors involved are large.

No artigo abaixo os autores descrevem a estrutura de indexação baseada em HASH usada pelo LIquid

Andrew Carter, Andrew Rodriguez, Yiming Yang, and Scott Meyer. 2019.
Nanosecond Indexing of Graph Data With Hash Maps and VLists.
In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19).
Association for Computing Machinery, New York, NY, USA, 623–635.
DOI:https://doi.org/10.1145/3299869.3314044



Qual é a diferença efetiva que o uso da estrutura de indexação em HASHmaps (key-value) traz?

The fact that social graph data is frequently skewed, with the largest fan-outs being many orders of magnitude larger than the average, exacerbates the performance difference between sorted and hash storage. In a sorted index, finding the size of a result set without actually materializing it will require a second binary search. In practice, it is difficult to decide when the second binary search is worth doing. In a hash-based index, the size of a set can be stored in the top-level hash table, available in 1 L3 miss. Detecting skew in advance of experiencing its adverse effects is cheap and easy.


E na questão da linguagem de consulta, pq usar Datalog?

No texto tem um exemplo de uma query SQL escrita em 25 linhas cuja query correspondente em Datalog tem 5 linhas! 

This is Datalog, a subset of the logic programming language Prolog. The above query is the same as the previous SQL query. Unlike SQL, Datalog composes easily:

While Datalog has a dearth of industrial-strength implementations, it is extremely well studied, commonly used in the study of relational databases. To adapt Datalog for our purposes, we need only require that Edge-relations be the only form which can be asserted. All rules must ultimately reduce to Edges.

O esquema do banco de dados também está definido a priori. Que tipo de problemas isso evita? 

Relative to object-graphs or SQL, we lack schema. All values are just strings. Worse, we have no way to insist that certain relationships be unique—for example, that an entity should have just one name. Without enforced uniqueness, there’s no way to map a graph into a normalized table.  

O LIquid é um hipergrafo? 

While many relationships are naturally represented as a subject-predicate-object triple: mother, name, author, there are also many which are not, like an employment tenure (person, company, start date), or an actor’s film performance (actor, role, film). For ternary relationships, we could force them into the existing Edge by making one of the vertices do double duty as a predicate: “Harrison Ford Han-Solo’d in Star Wars,” but this is a one-off hack that doesn't generalize to n-ary relationships.

A graph data model should allow arbitrary n-ary “compound edges” to behave just like simple edges, as unique relationships which either exist or do not.

Em relação a RDF, SPARQL, LPG e outras tecnologias de graph databases, qual seriam as vantagens dessa implementação?

RDF is quite close to LIquid’s Edge in spirit, but has introduced a great deal of complexity. Triples are asymmetric. ... SPARQL seems to have inherited most of SQL’s ailments.

Property graphs introduce a second schema for properties and force the user to choose between representing data as edges or as properties of nodes or edges. .... The obvious encoding of a property graph as a table of vertices (with properties) and a table of edges (with properties) will force two lookups for any query involving properties of vertices and edges.

While many graph database implementations are effectively “in RAM,” none explicitly optimize for random access with hash-based indexing, or exploit fast random access to set sizes in query evaluation. Nothing targets complex queries or supports the query composition necessary to do so easily.

Em resumo, quais são os quatro pilares do LIquid?

  1. All relations are equal and first-class. ... Tuples expressed in a single physical table are vastly more performant than those expressed by joining two or more tables.

  2. Edge traversal is fast and constant-time. If everything is stored as edges, we’re going to be doing a lot of joining, specifically self-joins. Every query with small intermediate results will almost certainly devolve into random access.

  3. Query results are a subgraph. Application models are complex and applications need to be able to retrieve arbitrarily complex models in a single query with a single consistency guarantee.

  4. Schema-evolution is constant-time. If schema-evolution (really, the addition of new predicates) takes non-constant time, then the database implementation must be pretending to operate with edges while maintaining structs or tables under the covers.










Comentários

  1. Comentei com o Sérgio por email sobre essa iniciativa por ser um grafo em Banco de Dados Relacional e principalmente pelo uso do Datalog.

    ResponderExcluir
  2. LIquid é um novo banco de dados baseado em grafo construído pelo LinkedIn para consulta/atualização do LinkedIn Economic Graph
    É uma implementação em um modelo relacional que suporta a definição e indexação de relacionamentos n-ários complexos (hipergrafos) e grafos de propriedades (LPG). A interface do LIquid é uma linguagem de consulta declarativa baseada no Datalog e possui uma estrutura e indexação baseada em Hash.
    Totalmente descrito por um esquema, as arestas são triplas de strings (sujeito, predicado, objeto) e grafos compostos podem ser construídos a partir de arestas para representar relacionamentos n-ários com atributos arbitrários.
    Nem o BD e nem o grafo estão disponíveis para download.

    ResponderExcluir
  3. Uma tabela de vértices com id e literal e uma tabela de arestas na forma sujeito, predicado, objeto (onde cada elemento é um vértice) .... se cada componente pode ser um vértice então é possível representar hipergrafos.

    ResponderExcluir

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...