Hernández, D., Hogan, A., & Krötzsch, M. (2015). Reifying RDF: What Works Well With Wikidata? In T. Liebig & A. Fokoue (Eds.), Proceedings of the 11th International Workshop on Scalable Semantic Web Knowledge Base Systems co-located with 14th International Semantic Web Conference {(ISWC} 2015), Bethlehem, PA, USA, October 11, 2015 (Vol. 1457, pp. 32–47). CEUR-WS.org. http://ceur-ws.org/Vol-1457/SSWS2015_paper3.pdf
Abstract
We are motivated by the goal of representing Wikidata as RDF, which would allow legacy Semantic Web languages, techniques and tools – for example, SPARQL engines – to be used for Wikidata. However, Wikidata annotates statements with qualifiers and references, which require some notion of reification to model in RDF.
We thus investigate four such options: (1) standard reification, (2) n-ary relations, (3) singleton properties, and (4) named graphs.
five popular SPARQL implementations: 4store, BlazeGraph, GraphDB, Jena TDB and Virtuoso.
Introduction
Indeed, the factual information of Wikidata corresponds quite closely with the RDF data model, where the main data item (entity) can be viewed as the subject of a triple and the attribute–value pairs associated with that item can be mapped naturally to predicates and objects associated with the subject. However, Wikidata also allows editors to annotate attribute–value pairs with additional information, such as qualifiers and references. Qualifiers provide context for the validity of the statement in question, for example providing a time period during which the statement was true. References point to authoritative sources from which the statement can be verified. About half of the statements in Wikidata (32.5 million) already provide a reference, and it is an important goal of the project to further increase this number.
... Named Graphs represents an extension of the traditional triple model, adding a fourth element; however, the notion of Named Graphs is well-supported in the SPARQL standard [10], and as “RDF Datasets” in RDF 1.1 [5].
Wikidata Data-model
However, the statement is also associated with some qualifiers and their values. Qualifiers are property terms such as start time, follows, etc., whose values may scope the validity of the statement and/or provide additional context. Additionally, statements are often associated with one or more references that support the claims and with a rank that marks the most important statements for a given property. The details are not relevant to our research: we can treat references and ranks as special types of qualifiers. We use the term statement to refer to a primary relation and its associated qualification ...
Conceptually, one could view Wikidata as a “Property Graph”: a directed labelled graph where edges themselves can have attributes and values [11,6]. A related idea would be to consider Wikidata as consisting of quins of the form (s, p, o, q, v), where (s, p, o) refers to the primary relation, q is a qualifier prop- erty, and v is a qualifier value ... All quins with a common primary relation would constitute a statement. However, quins of this form are not a suitable format for Wikidata since a given primary relation may be associated with different groupings of qualifiers.
.... For this reason, reification schemes based conceptually on quins – such as RDF* [12,11] – may not be directly suitable for Wikidata.
... we propose to view Wikidata conceptually in terms of two tables: one containing quads of the form (s, p, o, i) where (s, p, o) is a primary relation and i is an identifier for that statement; the other a triple table storing (i) primary relations that can never be qualified (e.g., labels) and thus do not need to be identified, (ii) triples of the form (i, q, v) that specify the qualifiers as- sociated to a statement, and (iii) triples of the form (v, x, y) that further describe the properties of qualifier values.
Compared to sextuples, the quad/triple schema only costs one additional tuple per statement, will lead to dense instances (even if some qualifiable primary relations are not currently qualified), and will not repeat the primary relation for each qualifier; conversely, the quad/triple schema may require more joins for certain query patterns (e.g., find primary relations with a follows qualifier)
From Higher Arity Data to RDF (and back)
The transformation from Wikidata to RDF can be seen as an instance of schema translation, where these questions then directly relate to the area of relative information capacity in the database community [14,17], which studies how one can translate from one database schema to another, and what sorts of guarantees can be made based on such a translation.
We require that any instance of Wikidata can be mapped to a RDF graph, and that any conjunctive query (CQ; select- project-join query in SQL) over the Wikidata instance can be translated to a conjunctive query over the RDF graph that returns the same answers. We call such a translation query dominating.
With this scheme, we require 4k triples to encode k quads. This encoding can
be generalised to encode database tables of arbitrary arity, which is essentially the approach taken by the Direct Mapping of relational databases to RDF.
Existing Reification Approaches
SPARQL Querying Experiments
To start with, we took the RDF export of Wikidata from Erxleben et al. [8] (2015-02-23), which was natively in an n-ary relation style format, and built the equivalent data for all four datasets. The number of triples common to all formats was 237.6 million.
Loading data: We selected five RDF engines for experiments: 4store, Blaze- Graph, GraphDB, Jena and Virtuoso. The first step was to load the four datasets for the four models into each engine. We immediately started encountering prob- lems with some of the engines. To quantify these issues, we created three collec- tions of 100,000, 200,000, and 400,000 raw statements and converted them into the four models.6 We then tried to load these twelve files into each engine. ... where we see that: (i) even though different models lead to different triple counts, index sizes were often nearly identical: we believe that since the entropy of the data is quite similar, compression manages to factor out the redundant repetitions in the models; (ii) some of the indexes start with some space allocated, where in fact for BlazeGraph, the initial allocation of disk space (200MB) was not affected by the data loads; (iii) 4store and GraphDB both ran into problems when loading singleton properties, where it seems the indexing schemes used assume a low number of unique predicates.7 With respect to point (iii), given that even small samples lead to blow-ups in index sizes, we decided not to proceed with indexing the singleton properties dataset in 4store or GraphDB.
Benchmark queries: From two online lists of test-case SPARQL queries, we selected a total of 14 benchmark queries....;
since we need to create four versions of each query for each reification model, we use an abstract quad syntax where necessary, which will be expanded in a model- specific way such that the queries will return the same answers over each model.
Query results: For each engine and each model, we ran the queries sequentially (Q1–14) five times on a cold index. Since the engines had varying degrees of caching behaviour after the first run – which is not the focus of this paper – we present times for the first “cold” run of each query.10 Since we aim to run 14 × 4×5×5 = 1, 400 query executions, to keep the experiment duration manageable, all engines were configured for a timeout of 60 seconds. Since different engines interpret timeouts differently (e.g., connection timeouts, overall timeouts, etc.), we considered any query taking longer than 60 seconds to run as a timeout.
For a given engine and query, we look at which model performed best, counting the number of firsts, seconds, thirds, fourths, failures (fa) and cases where the query could not be run (nr). ... The total column then adds the positions for all engines. From this, we see that looking across all five engines, named graphs is probably the best supported (fastest in 17/70 cases), with standard reification and and n-ary relations not far behind (fastest in 16/70 cases). All engines aside from Virtuoso seem to struggle with singleton properties; presumably these engines make some (arguably naive) assumptions that the number of unique predicates in the indexed data is low.
Virtuoso provides the most reliable/performant results across all models. Jena failed to return answers for singleton properties, timing-out on all queries. ... We see that both BlazeGraph and GraphDB managed to process most of the queries for the indexes we could build, but with longer runtimes than Virtuoso. In general, 4store struggled with the benchmark and returned few valid responses in the allowed time.
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.