Hernández, D., Hogan, A., Riveros, C., Rojas, C., & Zerega, E. (2016). Querying Wikidata: Comparing SPARQL, Relational and Graph Databases. In P. Groth, E. Simperl, A. J. G. Gray, M. Sabou, M. Krötzsch, F. Lécué, F. Flöck, & Y. Gil (Eds.), The Semantic Web - {ISWC} 2016 - 15th International Semantic Web Conference, Kobe, Japan, October 17-21, 2016, Proceedings, Part {II} (Vol. 9982, pp. 88–103). https://doi.org/10.1007/978-3-319-46547-0_10
Abstract
... querying the Wikidata knowledge-base, which can be conceptualised as a directed edge-labelled graph where edges can be annotated with meta-information called qualifiers.
We take two popular SPARQL databases (Virtuoso, Blazegraph), a popular relational database (PostgreSQL), and a popular graph database (Neo4J) for comparison and discuss various options as to how Wikidata can be represented in the models of each engine.
Introduction
... factual information in different locations will often be inconsistent. The aim of Wikidata is to instead keep such factual knowledge in one place: facts need be edited only once and can be drawn upon from multiple locations.
.... (WDQS) runs over an RDF representation of Wikidata that is indexed in the Blazegraph SPARQL store (formerly known as BigData) [20].
In Wikidata, items are connected either to related items or datatype values by directed, named relations. There is thus a close correspondence between Wiki- data and RDF. However, relations in Wikidata can be annotated with attribute– value pairs, such as qualifiers and references, to specify a further context for the relation (e.g., validity, cause, degree, etc.). This complicates the representation of Wikidata in RDF somewhat, requiring some form of reification [11] to capture the full knowledge-base in RDF.
A follow-up question arising from our initial previous work was how the performance of these SPARQL engines would compare with that of other technologies. In this paper, we thus extend on our previous work by comparing a selection of SPARQL, relational and graph databases for the purposes of querying Wiki- data. We select these families of databases since they correspond with the inher- ent structure of Wikidata and they offer support for comprehensively-featured declarative query languages as needed for the public query service (which rules out, for example, key–value stores, document stores and column-family stores). To conduct these comparisons, we design and apply a range of novel experiments.
Our first set of experiments that are based on the idea of performing sets of lookups for “atomic patterns” with exhaustive combinations of constants and variables; these results give an initial idea of the low-level performance of each configuration.
Next we perform experiments on sets of instances of basic graph patterns that generalise the graph motifs we found to commonly occur in the use-case queries listed at the public query service.
We also discuss how well the various engines support the query features commonly used for these use-case queries.
The Wikidata Model
Finally, it is important to note that Wikidata can contain multiple distinct statements with the same binary relation: for example, Clover Cleveland was the U.S. President for two non- consecutive terms (i.e., with different start and end times, different predecessors and successors). In Wikidata, this is represented as two separate statements whose primary relations are both identical, but where the qualifiers (start time, end time, follows, followed by) differ.
Wikidata Representations
In the context of querying Wikidata, our goal is to compare the performance of a selection of database engines from three different families: SPARQL, relational and graph. Each family is associated with a different data model: named (RDF) graphs, relations, and property graphs, respectively. Each data model may per- mit multiple possible representations of Wikidata, for example, different RDF reification schemes, different relational schema, etc
RDF/Named Graph Representations
Relational representations
... the relational representation we use for Wikidata, which involves three tables: Statement stores the primary relation and a statement id; Qualifier associates one or more qualifiers with a statement id; and Label associates each entity and property with one or more multilingual labels.
One complication with the relational model is that certain values – in particular the object (o) and qualifier value (v) – can have multiple types. Aside from entities, properties and logical values (e.g., exists, not exists), Wikidata contains four data-types – numeric, time, coordinates and strings – where as previously mentioned, each datatype can contain meta-information such as a calendar, pre- cision, etc. One option is to store all values in one column as strings (e.g., JSON objects); however, this precludes the possibility of using such datatype values in filters, doing range queries, ordering, etc. Another possibility is to create separate columns, such as exemplified in Figure 3 with odate and vdate; however, we would need at least five such columns to support all Wikidata datatypes, leading to a lot of nulls in the table, and we still cannot store the meta-information. A third option would be to create separate tables for different datatype values, but this increases the complexity of joins. We currently use JSON strings to serialise datatype values, along with their meta-information, and store different types in different columns (i.e., as per Figure 3 with, e.g., odate and vdate).
Property graph representations
One may argue that property graphs offer a natural fit for Wikidata, where
we first tried to represent Wikidata in Neo4J analogous to the “direct represen- tation” given in Figure 4, but we encountered a number of fundamental issues. Firstly, Neo4J only allows one value per attribute property: this could be cir- cumvented by using list values or special predicates such as label_en. Secondly and more generally, Neo4J does not currently support queries involving joins or lookups on any information associated with edges, including edge ids, edge types or edge attributes.
Experimental Setting
We now describe the setting for our experiments. Note that further details, including scripts, queries, raw data, configuration settings, instructions, etc., are provided at https://dx.doi.org/10.6084/m9.figshare.3219217.
Data: We use the Wikidata dump in JSON format from 2016/01/04
Machine: 2× Intel Xeon 32MB Cache SATA hard-disks in a RAID-1 configuration. Quad Core E5-2609 V3 CPUs, 32GB of RAM, and 2× 2TB Seaga
Engines: Blazegraph 2.1.0, Virtuoso 7.2.3-dev.3215-pthreads, PostgreSQL 9.1.20, Neo4J-community-2.3.1
Although Blazegraph supports a setting called RDF* to support reification [10], this model is non-standard and cannot directly encode multiple Wikidata statements with the same primary relation: in RDF*, each RDF triple must be unique. Though RDF* could be used to emulate named graphs, we instead use “Quad Mode” with standard SPARQL settings.
<<Mas Blazegraph suporta o WDQS atual>>
We used the default indexes labelled PSOG, POGS, SP, OP and GS, where S, P and O correspond to the positions of a triple and G to the name of a graph. Each such index is sorted and allows for prefix lookups, ...
We used PostgreSQL 9.1.20 set with maintenance_work_mem = 1920MB and shared_buffers = 7680MB. A secondary index (i.e. B-tree) was set for each attribute that stores either entities, properties or data values (e.g. dates) from Wikidata. Specifically, in the statement table (Figure 3), each attribute s, p, o, and odate has an index as well as the attributes q, v, and vdate in the qualifier table. In all other tables, only foreign keys attributes were indexed like, for example, attribute e in the Label table.
We used Neo4J-community-2.3.1 setting a 20GB heap. We used indexes to map from entity ids (e.g., Q42) and property ids (e.g., P1432) to their respective nodes. By default, Neo4J indexes nodes in the graph and their adjacent nodes in a linked-list style structure, such that it can navigate from a node to its neighbour(s) along a specific edge without having to refer back to the index.
Experimental Results
Atomic lookups: In our first experiments, we wish to test the performance of atomic lookups over Wikidata statements, where we focus in particular on qualified statements that require more complex representations to encode.
To begin, we generated a set of 1.5 million (data) quins from Wikidata, shuffled them, and used them to randomly generate 300 unique instances for each of the 31 patterns (we exclude the open pattern (?, ?, ?, ?, ?), which only has one instance). We select 300 since there are only 341 instances of (?, p, ?, ?, ?) (i.e., there are 341 unique properties used on qualified statements); hence 300 allows us to have the same number of instances per pattern. In total, we have 9,300 instances, which we translate into concrete queries for our six experimental representations. Since instances of patterns such as (?, p, ?, ?, ?) would generate millions of results, we set a limit of 10,000 results on all queries.
Legenda dos gráficos: Mean response times for the 31 patterns, and mean response times for All patterns, where N, B, V and P refer to Neo4J, Blazegraph, Virtuoso and Postgres, resp.; w/Lab and wo/Lab refer to Neo4J representations with and with- out label information, resp.; NR, SR, SP, and NG refer to n-ary relations, standard reification, singleton properties and named graphs, resp.; one, two and three (red) dots on a bar indicate that ]0–33%], ]33–66%] and ]66–100%] of queries timed out, resp.; timeouts are counted as 60 seconds; the x-axis labels indicate the pattern and the mean number of complete results for the instances
We see that PostgreSQL performs best for all but three patterns, and is often an order of magnitude faster than the next closest result, most often in the lower range of query times from 1–100 ms, where we believe that the choice of client may have an effect: the direct client that PostgreSQL offers does not incur the overhead of REST incurred by other engines, which may be significant in this range. In addition, PostgreSQL can answer instances of some patterns with a mean in the 1–10 ms range, suggesting that PostgreSQL is perhaps not touching the hard-disk in such cases (an SATA hard-disk takes in the order of 10 ms to seek), and is making more efficient use of RAM than other engines.
For example, in the named graphs representation, which is most similar to the relational version, a quin is encoded with a quad (s, p, o, i) (a triple in named graph i) and a triple (i, q, v) (in the default graph). There is no special key position, but Virtuoso, with its default indexes – PSOG, POGS, SP, OP, GS – is best suited to cases where predicates are bound along with a subject or object value. The pattern (?, p, ?, ?, v) in named graphs then illustrates a nasty case for Virtuoso: if the engine starts with the much more selective v constant (starting with p would generate many more initial results), its only choice is to use the OP index (the only index where O is key) to get q from (i, q, v) using v, then POGS to get i from the same triple (i, q, v) using (q, v), then GS to get s from (s, p, o, i) using i, then SP to get p again from (s, p, o, i) using s, then PSOG to get o again from (s, p, o, i) using (s, p), and so we end up joining across all five indexes in Virtuoso instead of two in the case of PostgreSQL. Similar cases can be found for other patterns and representations, where the lack of explicit keys complicates performing joins.
On the other hand, we see from the mean of All queries that the gap between Virtuoso and PostgreSQL closes somewhat: this is because Virtuoso is often competitive with PostgreSQL for more expensive patterns, even outperforming it for the least selective pattern (?, p, ?, ?, ?). These patterns with higher runtimes have a greater overall effect on the mean over all queries.
The atomic-lookup experiments show a clear distinction between two pairs
of engines: Neo4J–Blazegraph and Virtuoso–PostgreSQL. Although
PostgreSQL generally outperformed Virtuoso, these results may not hold
for other types of queries.
Basic graph patterns: ... we studied the 94 example queries present on the official Wikidata query service – queries submitted by the Wikidata community as exemplary use-cases of the service – to see what are the most common types of joins that appear. From this study, we identified that queries most commonly feature “s–s” joins – joins from subject to subject – and “s–o” joins – joins from subject to object, specifically on the primary relation. ... We thus designed a set of experiments to generate a large number of instances of basic graph patterns (bgps) that follow the same abstract structure as most commonly observed in the use-case queries.
The first query pattern, which we refer to as depth-1 “snowflake”, starts with
a subject variable and adds between 1 and 6 edges to it. The predicate of each edge is bound, and each object is made a constant or a variable with equal proba- bility. The subject variable is projected; object variables are projected or not with equal probability.
The second query pattern, which we call depth-2 “snowflake”, adds a second layer to the query. More specifically, we start with a depth-1 query and then select an object variable at random to be used as the subject variable for a nested depth- depth-1 bgp ...
For each setting, we run each batch of 300 queries sequentially and separately, clearing the cache and resetting the engine before each such batch. As before, each query is set to return a limit of 10,000 results and each engine is assigned an internal timeout of 60 seconds ...
We see that this time, Virtuoso performs best overall, where in particular the n-ary relations and named graph representations experience zero timeouts and outperform PostgreSQL (with a 15–19% timeout rate) by an order of mag- nitude in both experiments.
Virtuoso’s indexing scheme – which in a relational sense considers predicates as a form of key for complete quads – is well-tuned for this morphology of query and excels for these settings. In more depth, Virtuoso can perform prefix lookups, meaning that a pattern (s, p, ?) or (?, p, o) can be performed in one lookup on its PSOG/POGS indexes respectively.
Query features: The aforementioned 94 use-case queries (taken from the public query service) use a mix of query features, where from SPARQL 1.0, 45/94 queries use ORDER BY, 38/94 use OPTIONAL, 21/94 use LIMIT, and 10/94 use UNION; and from SPARQL 1.1, 27/94 use GROUP BY, 18/94 use recursive property paths (i.e., * or +, mainly for traversing type hierarchies), 10/94 use BIND, and 5/94 use negation in the form of NOT EXISTS. Likewise, queries often use FILTER and occasionally sub-queries, though usually relating to handling labels. In any case, we see that a broad range of SPARQL (1.1) features are often used.
The first issue was with recursive property paths. With respect to SPARQL, these queries cannot be expressed for SR and SP due to how the RDF is structured, and are only possible in NG assuming the default graph contains the union (or merge) of all named graphs.
Related Work
Conclusion
Our experiments revealed a number of strength and weaknesses for the tested engines and representations in the context of Wikidata. In terms of where systems could be improved, we recommend that Blazegraph and Neo4J should better isolate queries such that one poorly performing query does not cause a domino effect.
Likewise, our results show that Neo4J would benefit from better query planning statistics and algorithms, as well as the provision of more customisable indices, particular on edge information.
With respect to Virtuoso, it falls behind PostgreSQL in our first experiments due to the default indexing scheme chosen, but performs better in our second experiments based on real-world queries where predicates are bound and joins follow more “typical” patterns.
On the other hand, PostgreSQL could benefit from better support for semi-structured information, where JSONB is indeed a step in the right direction.
Regarding representations, we found that standard reification and named graphs performed best, with n-ary relations following in third, and singleton properties not being well-supported.
... but based on our results for query performance, the best configuration for a Wikidata query service would probably (currently) use Virtuoso and named graphs: a combination that performed well in experiments and that supports the various query features needed, including property paths.
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.