LIquid: The soul of a new graph database, Part 1

Um pouco sobre o texto

LIquid: The soul of a new graph database, Part 1

Disponível em https://engineering.linkedin.com/blog/2020/liquid-the-soul-of-a-new-graph-database-part-1
Acessado em 26/07/2020

O que é o LIquid?

LIquid, a new graph database built by LinkedIn to support human real-time querying of the economic graph. It is a complete implementation of the relational model that supports fast, constant-time traversal of graph edges with a relational graph data model that is simple and self-describing, yet still manages to support the definition and indexing of complex n-ary relationships and property graphs. LIquid’s interface is a declarative query language based on Datalog. LIquid’s query processing engine achieves high performance by using dynamic query planning on wait-free shared-memory index structures.

A maioria das redes sociais usam grafos para modelar as interações/relações entre perfis. Qual é o desafio que motivou LinkedIn a desenvolver o seu próprio graph database baseado no modelo relacional?

the value of an economic graph for the average member lies mostly in their second degree network. These are the connections of your connections, such as the colleague of an old school friend, or the new boss of a co-worker from a prior job.
...
the set of second degree connections is typically at least 50,000 entities (e.g.,people, schools, employers). It is large enough to be diverse, but at the same time, actionable: you are one new connection away from something that can happen. In fact, your next job is probably in your second degree network.

E o que seria um grafo econômico dentro desse contexto?

LinkedIn’s Economic Graph is a digital representation of the global economy.
Fonte: https://economicgraph.linkedin.com/

Apesar do LinkedIn já vir utilizando sistemas baseados em grafos, qual é o diferencial do LIquid em relação aos demais?

LIquid is the most recent of these systems, and the first one which is general purpose: a database that implements the relational model, fully described by a schema and accessed by a declarative query language with functionality comparable to SQL. In LIquid, edges are triples of strings, (subject, predicate, object), and compounds can be built out of edges to represent n-ary relationships with arbitrary attributes.

Linguagem declarativa como o SQL: Datalog
Modelo de grafo em triplas: RDF
Banco de Dados Relacional com Esquema (Bye-bye NOSQL)

Quais foram os problemas encontrados ao utilizar OODBs (banco de dados OO) para esse contexto de aplicação?

Query language: There is no standard declarative OO query language, hence no ability for the database to optimize unconstrained by imperative logic.

Snapshot consistency: A graph of objects in memory is a snapshot of the state of the world at a single point in time. A consistent snapshot is an extremely expensive illusion to maintain—traditionally via multi-version concurrency control, utterly intractable as soon as the graph is larger than a single machine.

Schema: While the OO premise of modeling objects directly is attractive, it comes with intractable schema development problems. Obvious uses of inheritance lead rather quickly to insurmountable difficulties. Given the ease of creating flawed schema, the fact that OO languages have no answer for schema evolution just compounds the problem. The de-facto answer to the schema evolution problem is to create new classes, ship them in a new executable, and somehow manage to read old serialized data (application files). Sometimes this works.

Relational + ORM?: In practice, there were few situations in which OODBs were substantially better than a relational database combined with an object-relational mapping layer that maintained the graph of objects as in-memory cache.

Transição do esquema rígido para o esquema flexível e finalmente para o esquema genérico (grafo)

To express a durable model of the world, this rigid encoding simply doesn’t work. In practice, object-oriented programs work around this problem by creating a series of executables, each encoding some specific schema and storing the data elsewhere (relational tables, documents) using a more flexible schema.

However, adding the type of the relationship to the inverted index is a step in the right direction. Since the use of fixed offsets and structure extension is actually the cause of schema problems, let’s generalize the notion of “type of relationship” to “predicate” in a generic edge:

struct edge {
identity* subject;
identity* predicate;
identity* object;
};

struct identity {
vector<edge> edges;
};

Notice that all objects are now structurally identical instances of identity, vectors of edges that refer to the object with one of subject, predicate, or object. There’s no schema evolution problem because there’s only one schema. The user can make up instances of identity that are predicates and start using them at any time. The cost of this flexibility is type-specific getter methods that compute their results by filtering edges for those with a matching predicate.

Como os dados do grafo são recuperados? Como a impedância entre o mundo relacional e o mundo OO é tratada?

... we’re not going to be running application code in the database. Instead, we will be running queries (expressed in a declarative query language like SQL) in the database and shipping the results back to the application code that can use whatever classes it wants to represent the result graph.

Como manter a consistência dos dados e garantir um bom desempenho? Como foi tratado o processamento de transações em ambiente distribuído?

Thus far, we’re still relying on snapshot consistency. For a query such as FindPredicates to get consistent results, nobody else can change any of the objects that it reads. Simple strategies like locking objects one at a time as the query encounters them are hopelessly slow, deadlock-prone, and awrong. What we need are transactions that behave as if they obtained read or write locks on an entire set of objects in an instant. While transaction processing is extremely well-studied, the sad truth is that relative to a joyful romp through a snapshot in RAM, transactions perform poorly, distributed transactions, extremely poorly. If we’re going to compete with pointer-chasing, we need access to the indexes at pointer-speed, with basically no coordination whatsoever, not even a compare-and-swap.

If we restrict ourselves to a single writer, there is a small vocabulary of data structures that have wait-free implementations: logs, linked lists, hash tables. This is just barely enough for our purposes. To start, the writer will append edges to a single “graph log.” Each edge can then be referred to by its offset in the log and the log offset functions as a virtual time which allows us to define consistency as “answering a query ‘as of’ a certain log offset.” The index will contain copies of edges from the graph log. Each identity/object in the index is essentially a special-purpose mini-log containing just edges that refer to that object. In order to use the indexes to compute a consistent result, we decorate each edge with its offset from the graph log.

To generate a consistent result, we choose a log offset, probably the current end of the graph_log, and then scan the index log backwards discarding edges until the offset is less than our chosen log offset. Thereafter, any edges we find may be part of the answer to our query.

Thus, relative to a snapshot, we have to do a little more work. But the work is uniformly distributed–there’s no possibility of lock contention or deadlock—and it can take place at the maximum speed that the processor is capable of. In theory, the single writer is a bottleneck, but in practice, we have no trouble appending edges to the log at a rate which exceeds the write rate of a large social graph by many orders of magnitude. An ever-growing log would be a liability for edges which frequently change, but most human-curated data does not have this property. Additions occur at human speed, and deletions, which we handle by appending a tombstone edge, are relatively infrequent.

Quais tipos de índices são usados e como são construídos?

While heavily optimized for both space and speed, these index structures are the basic index data structures that LIquid uses. They deliver fast, constant-time performance that allows graph edges to be used blithely, much as one would use pointers. While we still have much to describe, the outline of LIquid is clear: we build a log-structured, in-memory inverted index of edges, triples of strings. Internally, the strings are converted to integers, pointers to structures in the object-oriented formulation. We process queries using this index and return the results to the application program as a set of edges, a subgraph. The application program converts the edges into whatever structural idiom is convenient, a graph of structures or a DOM tree, and displays the results to the user. The ingredients for this conversion are simple: mappings from identity string to address or structure offset and possibly a topological sort of result edges so that the target hierarchy can be built bottom-up.

O AllegroGraph também faz a conversão dos strings em inteiros para criação dos índices.

Esperando a Parte II para saber como eles trataram esse problema que seria uma contra indicação ao uso de banco de dados relacionais

In the next part, we’ll describe how to express a graph in the relational model, why conventional relational databases don’t work very well for this purpose, and how LIquid can process relational graph data efficiently.

Importante mencionar que existiu outro graph database distribuído chamado FlockDB que usou o MySQL como storage engine e que foi construído e mantido pelo Twitter de 2010 a 2012.

FlockDB is a distributed graph database for storing adjancency lists...
FlockDB is much simpler than other graph databases such as neo4j because it tries to solve fewer problems. It scales horizontally and is designed for on-line, low-latency, high throughput environments such as web-sites.
Twitter uses FlockDB to store social graphs (who follows whom, who blocks whom) and secondary indices. As of April 2010, the Twitter FlockDB cluster stores 13+ billion edges and sustains peak traffic of 20k writes/second and 100k reads/second.

Fonte: https://github.com/twitter-archive/flockdb
https://blog.twitter.com/engineering/en_us/a/2010/introducing-flockdb.html

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens: realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward) Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Pesquisa de Doutorado da Veronica

Pesquisar este blog