MillenniumDB: A Persistent, Open-Source, Graph Database

MillenniumDB: A Persistent, Open-Source, Graph Database - Leitura de Artigo

Abstract: In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features.

1 INTRODUCTION

Within this graph database landscape, however, we foresee the need for yet another alternative: an open source graph database that implements state-of-the-art techniques; offers performance, scale and reliability competitive with (or ideally better than) enterprise systems; supports a variety of graph data models and query languages; and is extensible. To the best of our knowledge, no existing open source graph database system combines these features.

These foundations include the proposal of a data model – called domain graphs – that provides a simple abstraction capable of supporting multiple graph models and query languages, a novel query syntax adapted for this model, and a query engine that includes state-of-the-art join and path algorithms.

Various benchmarks over the Wikidata knowledge graph show that this first release of MillenniumDB generally outperforms prominent graph database systems – namely Blazegraph, Neo4j, Jena and Virtuoso – in terms of query performance.

2 DATA MODEL

... it (domain graph model) generalizes existing graph data models such as RDF and property graphs. We also show its utility in concisely modeling real-world knowledge graphs that contain higher-arity relations, such as Wikidata [43].

*** WD não tem aridade > 2, não tem hiper arestas, é hiper relacional ***

Formally, assume a universe Obj of objects (ids, strings, numbers, IRIs, etc.). We define domain graphs as follows:

Definition 2.1. A domain graph 𝐺 =(𝑂,𝛾) consists of a finite set of objects 𝑂 ⊆ Obj and a partial mapping 𝛾 : 𝑂 →𝑂 ×𝑂 ×𝑂.

*** Ou seja com o Id da aresta (que é único) é possível obter a relação Node1 x Label da Aresta x Node2 ***

Intuitively, 𝑂 is the set of objects that appear in our graph database, and 𝛾 models edges between objects. If 𝛾(𝑒) =(𝑛1,𝑡,𝑛2), this states that the edge (𝑛1,𝑡,𝑛2) has id 𝑒, type 𝑡, and links the source
node 𝑛1 to the target node 𝑛2

We can analogously define our model as a relation ... where eid (edge id) is a primary key of the relation

DomainGraph(source,type,target,eid)

The domain graph model of MillenniumDB already subsumes the RDF graph model

*** Basta esconder o id da aresta ou tratar como named graph - quad ***

Domain graphs can directly capture property graphs, ... However, given a legacy property graph, there are some potential “incompatibilities” with the resulting domain graph; for example, strings like "male", labels like human, etc., now become nodes in the graph, generating new paths through them that may affect query results.

*** no LPG as propriedades dos nós e arestas são na forma chave-valor onde valor só pode ser um literal e não um nó ***

But it is also heavily inspired by the needs of real-world knowledge graphs like Wikidata

A number of graph models have been proposed to capture higher-arity relations more concisely, including property graphs [13] and RDF* [16].

*** está considerando higher-arity as arestas entre arestas e não uma hiper aresta ***

On the other hand, RDF* allows an edge to be a node. ...

*** na realidade permite que uma tripla seja um nó. Se uma tripla é um grafo composto por dois nós e uma aresta, está mais próximo de um hiper nó e de grafo aninnhados do que hiper aresta e um hipergrafo ***

However, we can only represent (using RDF*) one of the statements (without reification), as we can only have one distinct node per edge;

Domain graphs are inspired by named graphs in RDF/SPARQL (where named graphs have one triple/edge each). Both domain graphs and named graphs can be represented as quads. However,
the edge ids of domain graphs identify each quad, which, as we will discuss in Section 4, simplifies indexing.

Named graphs were proposed to represent multiple RDF graphs for publishing and querying.
SPARQL thus uses syntax – such as FROM, FROM NAMED, GRAPH – that is unintuitive when querying singleton named graphs representing higher-arity relations.

*** Mas é possível acessar sem a cláusula FROM ***

Moreover, SPARQL does not support querying paths that span named graphs; in order to support
path queries over singleton named graphs, all edges would need to be duplicated (virtually or physically) into a single graph [18].

*** Achei estranho, preciso confirmar ***

Named graphs (with multiple edges) could be supported in domain graphs using a reserved term graph, and edges of the form 𝛾(𝑒3) =(𝑒1,graph,𝑔1), 𝛾(𝑒4) =(𝑒2,graph,𝑔1); optionally, named domain graphs could be considered in the future to support multiple domain graphs with quins.

*** Criar arestas adicionais para indicar quais arestas pertenem a quais grafos nomeados traz o mesmo problema da reificação ***

In parallel with our work, recently a data model analogous to domain graphs has been independently proposed for use in Amazon Neptune, which the authors call 1G [23]. Their proposal does not discuss a formal definition for the model, nor a query language, storage and indexing, implementation, etc., but the reasoning and justification that they put forward for the model is similar to ours.

To the best of our knowledge, our work is the first to describe a query language, storage and indexing schemes, query planner – and ultimately a fully-fledged graph database engine – built specifically for this model.

Wikidata requires Edge as nodes.... Only named graphs, domain graphs and property domain graphs can model such examples without reserved terms. Comparing named graphs and domain graphs, the latter sacrifices the “Graph as node” feature without reserved vocabulary to reduce indexing permu-
tations (discussed in Section 4). Such a feature is not needed by Wikidata.

*** Graph as node somente quando tem mais de uma tripla ??? ***

3 QUERY LANGUAGE

While the syntax of these query languages vary widely, they share a large common core in terms of querying primitives. Specifically, all such query languages support different types of graph patterns that transform a graph into a relation (aka. a table) of results [3]. The simplest form of graph pattern is a basic graph pattern, which in its most general form, is a graph pattern structured like the data, but that also allows variables to replace constants. Next we have navigational graph patterns, which allow for specifying path expressions that can match arbitrary-length paths in the graphs. We also have relational graph patterns which, noting that a graph pattern converts a graph into a relation, allows the use of relational operators to transform and/or combine the results of one or more graph patterns into a single relation. Finally, query languages may support other features, such as solution modifiers that order or limit results, convert results back into a graph, etc

We have thus implemented a base query language, called DGQL, which closely resembles Cypher ...

Querying edges. In order to query over edges, we can write the following query, which returns ??, i.e., the relation DomainGraph:

SELECT * MATCH (?x)-[?e TYPE(?t)]->(?y)

We can also restrict which edges are matched. The following query in DGQL will return ids for edges of type child, where both nodes have the same last name, and the child is not the oldest; it also returns the order of the child: SELECT ?e, ?e.order MATCH (?x)-[?e child]->(?y) WHERE (?x.lastname == ?y.lastname) AND (?e.order > "1")
This returns {{?e ↦ e2, ?e.order ↦ 2}} when evaluated over the graph in Figure 1. If we need to query for the second oldest (an equality condition), we could use the syntax (?x)-[?e child {order : "2"}]->(?y) on the edge (or replace (?e.order > "1") with (?e.order == "1")).

** Filtro pelos atributos nas arestas **

Querying known objects.

Path queries .. A key feature of graph databases is their ability to explore paths of arbitrary length. Like SPARQL [15] (but not Cypher), DGQL supports two-way regular path queries (2RPQs), which specify regular expressions over edge types, including concatenation (/), disjunction (|), inverses (^), optional (?), Kleene star (*) and Kleene plus (+). ... Like Cypher (but not SPARQL), DGQL can return a single shortest path witnessing the query result ...

Basic and navigational graph patterns. Basic graph patterns lie at the core of many graph query languages, including DGQL.

Taking advantage ofthe domain graph model. The DGQL query language allows us take full advantage of domain graphs, by allowing joins between edges, types, etc. To illustrate this, consider the following query, evaluated over the domain graph of Figure 4:

SELECT ?x, ?d
MATCH (Michelle Bachelet)-[?e position held]-> (President of Chile),
(?e)-[replaces]->(?x),
(?e)-[start date]->(?d)

The variable ?e invokes a join between an edge and a node, return- ing two solutions: {{?x ↦ Ricardo Lagos, ?d ↦ 2006-03-11}, {?x ↦ Sebastián Piñera, ?d ↦ 2014-03-11}}. Hence, this query is used to return the list ofpresidents that were replaced by Michelle Bachelet, and the dates they were replaced by her.

** Jução com atributos nas arestas para recuperar o valor dos qualificadores **

Optional matches. DGQL also supports optional graph patterns, which behave akin to left outer joins.

Limits and ordering. Some additional operators that MillenniumDB supports are LIMIT and ORDER BY. These allow us to limit the number of output mappings, and sort the obtained results

4 SYSTEM ARCHITECTURE

Storage. Internally, all objects are represented as 8 byte identifiers. To optimize query execution, identifiers are divided into classes and the first byte of the identifier specifies a class it belongs to.

*** 3 classes: nós, arestas e valores (literais) ***

All records stored in MillenniumDB are composed of these identifiers.

To store property domain graphs, MillenniumDB deploys B+ trees [27]. For this purpose, we built a B+ tree template for fixed sized records, which store all classes of identifiers.

All the B+ trees are created through a bulk-import phase, which loads multiple tuples of sorted data, instead of creating the trees by inserting records one by one. In order to enable fast lookups by edge identifier, we use the fact that this attribute is the key for the relation.

Therefore, we also store a table called EdgeTable, which contains triples of the form (source, type, target), such that the position in the table equals to the identifier of the object 𝑒 such that 𝛾(𝑒) =(source, type, target). This implies that edge identifiers must be assigned consecutive ids starting from zero, and they are generated in this way by MillenniumDB (they are not specified by the user). In total, we use ten B+ trees for storing the data.

To transform external strings and values (longer than 7 bytes) to database object ids and values, we have a single binary file called ObjectFile, which contains all such strings concatenated together. The
internal id of an external value is then equal to the position where it is written in the ObjectFile, thus allowing efficient lookups of a value via its id.

The identifiers are generated upon loading, and an additional hash table is kept to map a string to its identifier; we use this to ensure that no value is inserted twice, and to transform explicit values given in a query to their internal ids.

Only (long strings) are currently supported, but the implementation interface allows for adding different value types in a simple manner.

Access methods. All of the stored relations are accessed through linear iterators which provide access to one tuple at a time. All of the data is stored on pages of fixed (but parametrized) size (currently
4kB). The data from disk is loaded into a shared main memory buffer, whose size can be specified upon initializing the MillenniumDB server. The buffer uses the standard clock page replacement policy [27]. Additionally, for improved performance, upon initializing the server, it can be specified that the ObjectFile be loaded into main memory in order to quickly convert internal identifiers to string and integer values that do not fit into 7 bytes.

Evaluating a query. The query execution pipeline follows the standard database template:
(1) The string of the query is parsed into an Abstract Syntax Tree (AST), and checked to be syntactically correct.
(2) A logical plan is generated using a depth-first traversal of the AST, instantiating nodes of the tree as logical operators.
(3) Using a visitor pattern, the logical tree is parsed in order to apply some basic simplifications (e.g. pushing constants from filters into the search pattern).
(4) Finally, another pass of the logical tree using the visitor-pattern is made in order to create a physical plan.

*** Técnicas de BD relacional sendo aplicadas ***

5 BENCHMARKING

The experiments focus on two fundamental query features: (i) basic graph patterns (BGPs); and (ii) path queries.

We compare the performance of MillenniumDB with five persistent graph query engines. First, we include three popular RDF engines: Jena TDB version 4.1.0 [34], Blazegraph (BlazeG for short) version 2.1.6 [41], and Virtuoso version 7.2.6 [11]. We further include a property graph engine: Neo4J community edition 4.3.5 [44]. Finally, we also tested with Jena Leapfrog (Jena LF, for short) – a version of Jena TDB implementing a worst-case optimal leapfrog algorithm [21] – in order to compare the performance of our worst case optimal algorithm with a persistent engine iplementing a similar algorithm for evaluating BGPs. We further include an internal baseline, where we test MillenniumDB with (MillDB LF) and without (MillDB NL) the worst-case optimal join algorithm enabled; nested-loop joins are used in the latter case.

** Não separou o que é volume de dados do tamanho dos índices. MLDB é o dobro do Jena e Neo4J e quase 3x Virtuoso e BlazeGraph **

The data. We use two Wikidata datasets in our experiments. First, in order to compare query performance across various engines, we used the truthy dump version 20210623-truthy-BETA [12], keeping only triples in which (i) the subject position is a Wikidata entity, and (ii) the predicate is a direct property. We call this dataset Wikidata Truthy. The size of the dataset after this process was 1,257,169,959 triples

However, MillenniumDB was designed with the complete version of Wikidata – including qualifiers, references, etc. – in mind. We thus specifically test MillenniumDB on a second dataset, called Wikidata Complete, which comprises the entire Wikidata knowledge graph, including the aforementioned features.

*** Dois datasets, um sem qualificadores e outro com qualificadores. Não fez uso de reificação ao carregar nos demais sistemas pq só carregou o dataset sem qualificadores ***

To store the data, default indices were used on Jena TDB, Blazegraph and Virtuoso. Jena LF stores three additional permutations of the stored triples to efficiently support the leapfrog algorithm for any join query. Neo4j by default creates an index for edge types (as ofversion 4.3.5). To support faster searches for particular entities and properties, we also created an index linking a Wikidata identifier (such as, e.g., Q510) to its internal id in Neo4j. We also tried to index literal values in Neo4j, but the process failed (the literals are still stored). MillenniumDB uses extra disk space, because of the additional indices needed to support the domain graph model. Similarly, Jena LF indexes more permutations than Jena TDB in order to support worst-case optimal joins, hence using more space.

For scale-up experiments, we subsequently loaded Wikidata Complete into MillenniumDB, which further exploits the domain graph model. In this version, we model qualifiers (i.e. edges on
edges), put labels on objects, and assign them properties with values. We use properties to store the language value of each string in Wikidata, and also to model elements of complex data values ...
Qualifiers were loaded in their totality (i.e., not only the preferred qualifiers, but all specified ones). The only elements excluded from the data dump (for now) were sitelinks and references. This full
version of Wikidata resulted in a knowledge graph with roughly 300 million objects, participating in 4.3 ×109 edges. The total size on disk of this data was 827GB in MillenniumDB, i.e., more than
four times larger than Wikidata Truthy.

** Com qualificadores o tamanho ficou 4x maior **

How we ran the queries. We detail the query sets used for the experiments in their respective subsections. To simulate a realistic database load, we do not split queries into cold/hot run segments. Rather, we run them in succession, one after another, after a cold start of each system (and after cleaning the OS cache). This simulates the fact that query performance can vary significantly based on the state of the system buffer, or even on the state of the hard drive, or the state of OS’s virtual memory. For each system, queries were run in the same order. We record the execution time of each individual query, which includes iterating over all results. We set a limit of 100,000 distinct results for each query, again in order to enable comparability as some engines showed instability when returning larger results. Jena and Blazegraph were assigned 64GB or RAM, and Virtuoso was set up with 64GB or more of RAM as is recommended. Neo4J was run with default settings, while MillenniumDB had access to 32GB for main-memory buffer, and it uses an additional 10GB for in-memory dictionaries.

Handling timeouts. We defined a timeout of 10 minutes per query for each system. As we will see, MillenniumDB was the only system not to time out on any query.

Troublesome BGPs. The Wikidata SPARQL query log contains millions of queries ... a smaller log of queries that timed-out on the Wikidata public endpoint [24]. From these queries we extracted their BGPs, removing duplicates (modulo isomorphism on query variables). We further removed queries that give zero results over Wikidata Truthy. We distinguish queries consisting of a single triple pattern (Single) from those containing more than one triple pattern (Multiple). The former set tests the triple matching capabilities of the systems, whereas queries in the latter set test join performance. Single contains 399 queries, whereas Multiple has 436 queries.

MillenniumDB sharply outperforms the alternatives on the Single queries. The next best performers on average, Blazegraph and Virtuoso, are more than 30 times slower for average runtimes. In terms of medians, Blazegraph comes closest, but is still almost twice as slow as MillenniumDB.

MillenniumDB also outperforms the other systems on queries of the Multiple set. Its medians are an order of magnitude faster than those of Blazegraph, the next best contender. Indeed, the median of Blazegraph is an order of magnitude higher than all the queries in MillenniumDB (excluding outliers).

Complex BGPs. This is a benchmark used to test the performance of worst-case optimal joins [21]. Here, 17 different complex join patterns were selected, and 50 different queries generated for each pattern, resulting in a total of 850 queries. .... In this case, the difference between the two evaluation strategies of MillenniumDB is more clear. The worst-case-optimal version (MillDB LF) is not only considerably more stable than the nested-loop version (MillDB NL), but also twice as fast in the median. After MillDB LF, the next- best competitor is Jena LF, showing the benefits of worse-case optimal joins. Virtuoso follows not far behind, while MillDB NL, Jena, Blazegraph and Neo4j are considerably slower.

Property Paths To test the performance of path queries, we extracted 2RPQ expressions from a log of queries that timed out on the Wikidata endpoint. The original log has 2110 queries. After removing queries that yield no results (due to missing data), we ended up with 1683 queries. These were run in succession, each restricted to return at most 100,000 results. Each system was started after cleaning the system cache, and with a timeout of 10 minutes. Since these are originally SPARQL queries, not all of them were sup- ported by Neo4J given the restricted regular-expression syntax it supports. We remark that MillenniumDB and Neo4J were the only systems able to handle timeouts without being restarted.9 In this comparison we did not include MillDB NL, nor Jena LF, since these deploy precisely the same execution strategy for property paths as do MillDB LF, and Jena, respectively.

We can observe that, again, MillenniumDB is the fastest and has the most stable performance, being able to run all the queries. Its average is near a second, i.e., five times faster than the next best contender (Virtuoso). Its median, below 0.1 seconds, is half the next one (Jena’s). Even after removing the queries that timed-out on the other systems, they are considerably slower than MillenniumDB.

Wikidata Complete To show the scalability of MillenniumDB, and to leverage its domain graph, we ran experiments with Wikidata Complete to mea- sure query performance. We ran the same queries from the four benchmarks above (Single, Multiple, and Complex BGPs, as well as property paths). The number of outputs on the two versions of the data, while not the same, was within the same order of magnitude averaged over all the queries. ... As we can observe, MillenniumDB shows no deterioration in performance when a larger database is considered for similar queries.

** Só executou no MLDB, não comparou com outros e nem incluiu consultas que explorassem os qualificadores **

6 CONCLUSIONS AND LOOKING AHEAD

Domain graphs adopt the natural idea of adding edge ids to directed labeled edges in order to concisely model higher-arity relations in graphs, as needed in Wikidata, without the need for reserved vocabulary or reification.

While the idea of using edge ids as a hook for modeling higher-arity relations in graphs is far from new (see, e.g., [18, 22, 23]), it is an idea that is garnering increased attention as a more flexible and concise alternative to reification.

*** Falta ler a 23 do Amazon Neptune ***

[18] Daniel Hernández, Aidan Hogan, and Markus Krötzsch. 2015. Reifying RDF: What
Works Well With Wikidata?. In Proceedings of the 11th International Workshop on
Scalable Semantic Web Knowledge Base Systems co-located with 14th International
SemanticWeb Conference (ISWC 2015), Bethlehem, PA, USA, October 11, 2015 (CEUR
Workshop Proceedings), Thorsten Liebig and Achille Fokoue (Eds.), Vol. 1457.
CEUR-WS.org, 32–47. http://ceur-ws.org/Vol-1457/SSWS2015_paper3.pdf

[22] Filip Ilievski, Daniel Garijo, Hans Chalupsky, Naren Teja Divvala, Yixiang Yao,
Craig Milo Rogers, Ronpeng Li, Jun Liu, Amandeep Singh, Daniel Schwabe,
and Pedro A. Szekely. 2020. KGTK: A Toolkit for Large Knowledge Graph
Manipulation and Analysis. In International Semantic Web Conference (ISWC).
Springer, 278–293.

[23] Ora Lassila, Michael Schmidt, Brad Bebee, Dave Bechberger, Willem Broekema,
Ankesh Khandelwal, Kelvin Lawrence, Ronak Sharda, and Bryan B. Thompson.
2021. Graph? Yes! Which one? Help! CoRR abs/2110.13348 (2021).
arXiv:2110.13348 https://arxiv.org/abs/2110.13348

To the best of our knowledge, our work is the first to propose a formal data model that incorporates edge ids, a query language that can take advantage of them, and a fully-fledged graph database engine that supports them by design.

With respect to the query language, we have proposed a new query syntax inspired by Cypher, but that additionally enables users to take full advantage of the domain graph model by (optionally) referencing edge ids in their queries, and performing joins on any element of the domain graph.

*** Kypher também foi criada como uma Cypher modificada para permitir referenciar o id da aresta ***

In the implementation of MillenniumDB, we combine both tried-and-trusted techniques that have been successfully used in relational database pipelines for decades [27] (e.g., B+ trees, buffer managers, etc.

Regarding more practical features, we aim to add support for full transactions, compact index structures, keyword search, a graph update language, existing graph query languages, and more besides.

More importantly, given that MillenniumDB is published as an open source engine, we hope that the research community can view the MillenniumDB code base as a sort of a sandbox for incorporating their novel algorithms and ideas into a modern graph database, without the need to remake storage, indexing, access methods, or query parsers. We also wish to explore the deployment of MillenniumDB for key use-cases; for example, we plan to provide and host an alternative query service for Wikidata, which may help to prioritize the addition of novel features and optimizations as needed in practice.

GitHub do GraphDB -> https://github.com/MillenniumDB/MillenniumDB

Quad Model

Currently QuadModel is the only one implemented. The only restriction to the generic model presented before is that every connection must have one type. Thus connections can be saved as a tuple of 4 elements: <ConnectionID, FromID, ToID, TypeID>.

GitHub do Benchmark -> https://github.com/MillenniumDB/benchmark

Comentários

Veronica dos Santos20 de dezembro de 2022 às 18:17
“Truthy statements represent statements that have the best non-deprecated rank for a given property. Namely, if there is a preferred statement for a property P, then only preferred statements for P will be considered truthy. Otherwise, all normal-rank statements for P are considered truthy.”
ResponderExcluir
Respostas

Adicionar comentário

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Pesquisa de Doutorado da Veronica

Pesquisar este blog