LIquid: The soul of a new graph database, Part 2

Nesse primeiro post está a parte 1

A parte 2 está disponível em https://engineering.linkedin.com/blog/2020/liquid--the-soul-of-a-new-graph-database--part-2

Acessado em 06/01/2021

Como o grafo LIquid é representado na forma relacional?

In fact, representing a graph as an adjacency list of edges in a tabular relational system is easy:

Uma tabela de vértices com id e literal e uma tabela de arestas na forma sujeito, predicado, objeto (onde cada elemento é um vértice) ... mais parecido com LPG do que RDF

Vantagens: não requer tratamento de NULOS, modelo relacional é bem estudado e fundamentado, não requer normalização e fácil portabilidade entre SGBDs.

Desvantagens: Self-Joins e problema de impedância entre o SQL e OO

If the table is implemented by a sorted structure, some form of B-tree is nearly universal in tabular relational systems, random probing is O(log(N)), and N and the constant factors involved are large.

No artigo abaixo os autores descrevem a estrutura de indexação baseada em HASH usada pelo LIquid

Andrew Carter, Andrew Rodriguez, Yiming Yang, and Scott Meyer. 2019.
Nanosecond Indexing of Graph Data With Hash Maps and VLists.
In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19).
Association for Computing Machinery, New York, NY, USA, 623–635.
DOI:https://doi.org/10.1145/3299869.3314044

Qual é a diferença efetiva que o uso da estrutura de indexação em HASHmaps (key-value) traz?

The fact that social graph data is frequently skewed, with the largest fan-outs being many orders of magnitude larger than the average, exacerbates the performance difference between sorted and hash storage. In a sorted index, finding the size of a result set without actually materializing it will require a second binary search. In practice, it is difficult to decide when the second binary search is worth doing. In a hash-based index, the size of a set can be stored in the top-level hash table, available in 1 L3 miss. Detecting skew in advance of experiencing its adverse effects is cheap and easy.

E na questão da linguagem de consulta, pq usar Datalog?

No texto tem um exemplo de uma query SQL escrita em 25 linhas cuja query correspondente em Datalog tem 5 linhas!

This is Datalog, a subset of the logic programming language Prolog. The above query is the same as the previous SQL query. Unlike SQL, Datalog composes easily:

While Datalog has a dearth of industrial-strength implementations, it is extremely well studied, commonly used in the study of relational databases. To adapt Datalog for our purposes, we need only require that Edge-relations be the only form which can be asserted. All rules must ultimately reduce to Edges.

O esquema do banco de dados também está definido a priori. Que tipo de problemas isso evita?

Relative to object-graphs or SQL, we lack schema. All values are just strings. Worse, we have no way to insist that certain relationships be unique—for example, that an entity should have just one name. Without enforced uniqueness, there’s no way to map a graph into a normalized table.

O LIquid é um hipergrafo?

While many relationships are naturally represented as a subject-predicate-object triple: mother, name, author, there are also many which are not, like an employment tenure (person, company, start date), or an actor’s film performance (actor, role, film). For ternary relationships, we could force them into the existing Edge by making one of the vertices do double duty as a predicate: “Harrison Ford Han-Solo’d in Star Wars,” but this is a one-off hack that doesn't generalize to n-ary relationships.

A graph data model should allow arbitrary n-ary “compound edges” to behave just like simple edges, as unique relationships which either exist or do not.

Em relação a RDF, SPARQL, LPG e outras tecnologias de graph databases, qual seriam as vantagens dessa implementação?

RDF is quite close to LIquid’s Edge in spirit, but has introduced a great deal of complexity. Triples are asymmetric. ... SPARQL seems to have inherited most of SQL’s ailments.

Property graphs introduce a second schema for properties and force the user to choose between representing data as edges or as properties of nodes or edges. .... The obvious encoding of a property graph as a table of vertices (with properties) and a table of edges (with properties) will force two lookups for any query involving properties of vertices and edges.

While many graph database implementations are effectively “in RAM,” none explicitly optimize for random access with hash-based indexing, or exploit fast random access to set sizes in query evaluation. Nothing targets complex queries or supports the query composition necessary to do so easily.

Em resumo, quais são os quatro pilares do LIquid?

All relations are equal and first-class. ... Tuples expressed in a single physical table are vastly more performant than those expressed by joining two or more tables.
Edge traversal is fast and constant-time. If everything is stored as edges, we’re going to be doing a lot of joining, specifically self-joins. Every query with small intermediate results will almost certainly devolve into random access.
Query results are a subgraph. Application models are complex and applications need to be able to retrieve arbitrarily complex models in a single query with a single consistency guarantee.
Schema-evolution is constant-time. If schema-evolution (really, the addition of new predicates) takes non-constant time, then the database implementation must be pretending to operate with edges while maintaining structs or tables under the covers.

Comentários

Veronica dos Santos29 de abril de 2021 às 12:08
Comentei com o Sérgio por email sobre essa iniciativa por ser um grafo em Banco de Dados Relacional e principalmente pelo uso do Datalog.
ResponderExcluir
Respostas
Veronica dos Santos29 de abril de 2021 às 12:29
LIquid é um novo banco de dados baseado em grafo construído pelo LinkedIn para consulta/atualização do LinkedIn Economic Graph
É uma implementação em um modelo relacional que suporta a definição e indexação de relacionamentos n-ários complexos (hipergrafos) e grafos de propriedades (LPG). A interface do LIquid é uma linguagem de consulta declarativa baseada no Datalog e possui uma estrutura e indexação baseada em Hash.
Totalmente descrito por um esquema, as arestas são triplas de strings (sujeito, predicado, objeto) e grafos compostos podem ser construídos a partir de arestas para representar relacionamentos n-ários com atributos arbitrários.
Nem o BD e nem o grafo estão disponíveis para download.
ResponderExcluir
Respostas
Veronica dos Santos1 de julho de 2021 às 12:36
Uma tabela de vértices com id e literal e uma tabela de arestas na forma sujeito, predicado, objeto (onde cada elemento é um vértice) .... se cada componente pode ser um vértice então é possível representar hipergrafos.
ResponderExcluir
Respostas

Adicionar comentário

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Pesquisa de Doutorado da Veronica

Pesquisar este blog