Pular para o conteúdo principal

Querying Wikidata: Comparing SPARQL, Relational and Graph Databases - Leitura de Artigo

Hernández, D., Hogan, A., Riveros, C., Rojas, C., & Zerega, E. (2016). Querying Wikidata: Comparing SPARQL, Relational and Graph Databases. In P. Groth, E. Simperl, A. J. G. Gray, M. Sabou, M. Krötzsch, F. Lécué, F. Flöck, & Y. Gil (Eds.), The Semantic Web - {ISWC} 2016 - 15th International Semantic Web Conference, Kobe, Japan, October 17-21, 2016, Proceedings, Part {II} (Vol. 9982, pp. 88–103). https://doi.org/10.1007/978-3-319-46547-0_10

Abstract

... querying the Wikidata knowledge-base, which can be conceptualised as a directed edge-labelled graph where edges can be annotated with meta-information called qualifiers. 

We take two popular SPARQL databases (Virtuoso, Blazegraph), a popular relational database (PostgreSQL), and a popular graph database (Neo4J) for comparison and discuss various options as to how Wikidata can be represented in the models of each engine. 

Introduction

... factual information in different locations will often be inconsistent. The aim of Wikidata is to instead keep such factual knowledge in one place: facts need be edited only once and can be drawn upon from multiple locations.

.... (WDQS) runs over an RDF representation of Wikidata that is indexed in the Blazegraph SPARQL store (formerly known as BigData) [20].

In Wikidata, items are connected either to related items or datatype values by directed, named relations. There is thus a close correspondence between Wiki- data and RDF. However, relations in Wikidata can be annotated with attribute– value pairs, such as qualifiers and references, to specify a further context for the relation (e.g., validity, cause, degree, etc.). This complicates the representation of Wikidata in RDF somewhat, requiring some form of reification [11] to capture the full knowledge-base in RDF. 

A follow-up question arising from our initial previous work was how the performance of these SPARQL engines would compare with that of other technologies. In this paper, we thus extend on our previous work by comparing a selection of SPARQL, relational and graph databases for the purposes of querying Wiki- data. We select these families of databases since they correspond with the inher- ent structure of Wikidata and they offer support for comprehensively-featured declarative query languages as needed for the public query service (which rules out, for example, key–value stores, document stores and column-family stores). To conduct these comparisons, we design and apply a range of novel experiments.

Our first set of experiments that are based on the idea of performing sets of lookups for “atomic patterns” with exhaustive combinations of constants and variables; these results give an initial idea of the low-level performance of each configuration. 

Next we perform experiments on sets of instances of basic graph patterns that generalise the graph motifs we found to commonly occur in the use-case queries listed at the public query service. 

We also discuss how well the various engines support the query features commonly used for these use-case queries.

The Wikidata Model

Finally, it is important to note that Wikidata can contain multiple distinct statements with the same binary relation: for example, Clover Cleveland was the U.S. President for two non- consecutive terms (i.e., with different start and end times, different predecessors and successors). In Wikidata, this is represented as two separate statements whose primary relations are both identical, but where the qualifiers (start time, end time, follows, followed by) differ. 

Wikidata Representations

In the context of querying Wikidata, our goal is to compare the performance of a selection of database engines from three different families: SPARQL, relational and graph. Each family is associated with a different data model: named (RDF) graphs, relations, and property graphs, respectively. Each data model may per- mit multiple possible representations of Wikidata, for example, different RDF reification schemes, different relational schema, etc

RDF/Named Graph Representations

Relational representations 

... the relational representation we use for Wikidata, which involves three tables: Statement stores the primary relation and a statement id; Qualifier associates one or more qualifiers with a statement id; and Label associates each entity and property with one or more multilingual labels. 

One complication with the relational model is that certain values – in particular the object (o) and qualifier value (v) – can have multiple types. Aside from entities, properties and logical values (e.g., exists, not exists), Wikidata contains four data-types – numeric, time, coordinates and strings – where as previously mentioned, each datatype can contain meta-information such as a calendar, pre- cision, etc. One option is to store all values in one column as strings (e.g., JSON objects); however, this precludes the possibility of using such datatype values in filters, doing range queries, ordering, etc. Another possibility is to create separate columns, such as exemplified in Figure 3 with odate and vdate; however, we would need at least five such columns to support all Wikidata datatypes, leading to a lot of nulls in the table, and we still cannot store the meta-information. A third option would be to create separate tables for different datatype values, but this increases the complexity of joins. We currently use JSON strings to serialise datatype values, along with their meta-information, and store different types in different columns (i.e., as per Figure 3 with, e.g., odate and vdate).


Property graph representations

One may argue that property graphs offer a natural fit for Wikidata, where
we first tried to represent Wikidata in Neo4J analogous to the “direct represen- tation” given in Figure 4, but we encountered a number of fundamental issues. Firstly, Neo4J only allows one value per attribute property: this could be cir- cumvented by using list values or special predicates such as label_en. Secondly and more generally, Neo4J does not currently support queries involving joins or lookups on any information associated with edges, including edge ids, edge types or edge attributes.

Experimental Setting

We now describe the setting for our experiments. Note that further details, including scripts, queries, raw data, configuration settings, instructions, etc., are provided at https://dx.doi.org/10.6084/m9.figshare.3219217. 

Data: We use the Wikidata dump in JSON format from 2016/01/04

Machine: 2× Intel Xeon 32MB Cache SATA hard-disks in a RAID-1 configuration. Quad Core E5-2609 V3 CPUs, 32GB of RAM, and 2× 2TB Seaga

Engines: Blazegraph 2.1.0, Virtuoso 7.2.3-dev.3215-pthreads, PostgreSQL 9.1.20, Neo4J-community-2.3.1 

Although Blazegraph supports a setting called RDF* to support reification [10], this model is non-standard and cannot directly encode multiple Wikidata statements with the same primary relation: in RDF*, each RDF triple must be unique. Though RDF* could be used to emulate named graphs, we instead use “Quad Mode” with standard SPARQL settings. 

<<Mas Blazegraph suporta o WDQS atual>>

We used the default indexes labelled PSOG, POGS, SP, OP and GS, where S, P and O correspond to the positions of a triple and G to the name of a graph. Each such index is sorted and allows for prefix lookups, ...

We used PostgreSQL 9.1.20 set with maintenance_work_mem = 1920MB and shared_buffers = 7680MB. A secondary index (i.e. B-tree) was set for each attribute that stores either entities, properties or data values (e.g. dates) from Wikidata. Specifically, in the statement table (Figure 3), each attribute s, p, o, and odate has an index as well as the attributes q, v, and vdate in the qualifier table. In all other tables, only foreign keys attributes were indexed like, for example, attribute e in the Label table.  

We used Neo4J-community-2.3.1 setting a 20GB heap. We used indexes to map from entity ids (e.g., Q42) and property ids (e.g., P1432) to their respective nodes. By default, Neo4J indexes nodes in the graph and their adjacent nodes in a linked-list style structure, such that it can navigate from a node to its neighbour(s) along a specific edge without having to refer back to the index.

Experimental Results

Atomic lookups: In our first experiments, we wish to test the performance of atomic lookups over Wikidata statements, where we focus in particular on qualified statements that require more complex representations to encode.

To begin, we generated a set of 1.5 million (data) quins from Wikidata, shuffled them, and used them to randomly generate 300 unique instances for each of the 31 patterns (we exclude the open pattern (?, ?, ?, ?, ?), which only has one instance). We select 300 since there are only 341 instances of (?, p, ?, ?, ?) (i.e., there are 341 unique properties used on qualified statements); hence 300 allows us to have the same number of instances per pattern. In total, we have 9,300 instances, which we translate into concrete queries for our six experimental representations. Since instances of patterns such as (?, p, ?, ?, ?) would generate millions of results, we set a limit of 10,000 results on all queries.

Legenda dos gráficos: Mean response times for the 31 patterns, and mean response times for All patterns, where N, B, V and P refer to Neo4J, Blazegraph, Virtuoso and Postgres, resp.; w/Lab and wo/Lab refer to Neo4J representations with and with- out label information, resp.; NR, SR, SP, and NG refer to n-ary relations, standard reification, singleton properties and named graphs, resp.; one, two and three (red) dots on a bar indicate that ]0–33%], ]33–66%] and ]66–100%] of queries timed out, resp.; timeouts are counted as 60 seconds; the x-axis labels indicate the pattern and the mean number of complete results for the instances

We see that PostgreSQL performs best for all but three patterns, and is often an order of magnitude faster than the next closest result, most often in the lower range of query times from 1–100 ms, where we believe that the choice of client may have an effect: the direct client that PostgreSQL offers does not incur the overhead of REST incurred by other engines, which may be significant in this range. In addition, PostgreSQL can answer instances of some patterns with a mean in the 1–10 ms range, suggesting that PostgreSQL is perhaps not touching the hard-disk in such cases (an SATA hard-disk takes in the order of 10 ms to seek), and is making more efficient use of RAM than other engines.

For example, in the named graphs representation, which is most similar to the relational version, a quin is encoded with a quad (s, p, o, i) (a triple in named graph i) and a triple (i, q, v) (in the default graph). There is no special key position, but Virtuoso, with its default indexes – PSOG, POGS, SP, OP, GS – is best suited to cases where predicates are bound along with a subject or object value. The pattern (?, p, ?, ?, v) in named graphs then illustrates a nasty case for Virtuoso: if the engine starts with the much more selective v constant (starting with p would generate many more initial results), its only choice is to use the OP index (the only index where O is key) to get q from (i, q, v) using v, then POGS to get i from the same triple (i, q, v) using (q, v), then GS to get s from (s, p, o, i) using i, then SP to get p again from (s, p, o, i) using s, then PSOG to get o again from (s, p, o, i) using (s, p), and so we end up joining across all five indexes in Virtuoso instead of two in the case of PostgreSQL. Similar cases can be found for other patterns and representations, where the lack of explicit keys complicates performing joins. 

On the other hand, we see from the mean of All queries that the gap between Virtuoso and PostgreSQL closes somewhat: this is because Virtuoso is often competitive with PostgreSQL for more expensive patterns, even outperforming it for the least selective pattern (?, p, ?, ?, ?). These patterns with higher runtimes have a greater overall effect on the mean over all queries.

The atomic-lookup experiments show a clear distinction between two pairs of engines: Neo4J–Blazegraph and Virtuoso–PostgreSQL. Although PostgreSQL generally outperformed Virtuoso, these results may not hold for other types of queries. 

Basic graph patterns: ... we studied the 94 example queries present on the official Wikidata query service – queries submitted by the Wikidata community as exemplary use-cases of the service – to see what are the most common types of joins that appear. From this study, we identified that queries most commonly feature “s–s” joins – joins from subject to subject – and “s–o” joins – joins from subject to object, specifically on the primary relation. ... We thus designed a set of experiments to generate a large number of instances of basic graph patterns (bgps) that follow the same abstract structure as most commonly observed in the use-case queries.

The first query pattern, which we refer to as depth-1 “snowflake”, starts with
a subject variable and adds between 1 and 6 edges to it. The predicate of each edge is bound, and each object is made a constant or a variable with equal proba- bility. The subject variable is projected; object variables are projected or not with equal probability. 

The second query pattern, which we call depth-2 “snowflake”, adds a second layer to the query. More specifically, we start with a depth-1 query and then select an object variable at random to be used as the subject variable for a nested depth- depth-1 bgp ...

For each setting, we run each batch of 300 queries sequentially and separately, clearing the cache and resetting the engine before each such batch. As before, each query is set to return a limit of 10,000 results and each engine is assigned an internal timeout of 60 seconds ...

We see that this time, Virtuoso performs best overall, where in particular the n-ary relations and named graph representations experience zero timeouts and outperform PostgreSQL (with a 15–19% timeout rate) by an order of mag- nitude in both experiments.

Virtuoso’s indexing scheme – which in a relational sense considers predicates as a form of key for complete quads – is well-tuned for this morphology of query and excels for these settings. In more depth, Virtuoso can perform prefix lookups, meaning that a pattern (s, p, ?) or (?, p, o) can be performed in one lookup on its PSOG/POGS indexes respectively.

Query features: The aforementioned 94 use-case queries (taken from the public query service) use a mix of query features, where from SPARQL 1.0, 45/94 queries use ORDER BY, 38/94 use OPTIONAL, 21/94 use LIMIT, and 10/94 use UNION; and from SPARQL 1.1, 27/94 use GROUP BY, 18/94 use recursive property paths (i.e., * or +, mainly for traversing type hierarchies), 10/94 use BIND, and 5/94 use negation in the form of NOT EXISTS. Likewise, queries often use FILTER and occasionally sub-queries, though usually relating to handling labels. In any case, we see that a broad range of SPARQL (1.1) features are often used.

The first issue was with recursive property paths. With respect to SPARQL, these queries cannot be expressed for SR and SP due to how the RDF is structured, and are only possible in NG assuming the default graph contains the union (or merge) of all named graphs.

Related Work

Conclusion

Our experiments revealed a number of strength and weaknesses for the tested engines and representations in the context of Wikidata. In terms of where systems could be improved, we recommend that Blazegraph and Neo4J should better isolate queries such that one poorly performing query does not cause a domino effect. 

Likewise, our results show that Neo4J would benefit from better query planning statistics and algorithms, as well as the provision of more customisable indices, particular on edge information. 

With respect to Virtuoso, it falls behind PostgreSQL in our first experiments due to the default indexing scheme chosen, but performs better in our second experiments based on real-world queries where predicates are bound and joins follow more “typical” patterns. 

On the other hand, PostgreSQL could benefit from better support for semi-structured information, where JSONB is indeed a step in the right direction. 

Regarding representations, we found that standard reification and named graphs performed best, with n-ary relations following in third, and singleton properties not being well-supported.

... but based on our results for query performance, the best configuration for a Wikidata query service would probably (currently) use Virtuoso and named graphs: a combination that performed well in experiments and that supports the various query features needed, including property paths.


Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...