Pular para o conteúdo principal

Reifying RDF: What Works Well With Wikidata? - Leitura de Artigo

Hernández, D., Hogan, A., & Krötzsch, M. (2015). Reifying RDF: What Works Well With Wikidata? In T. Liebig & A. Fokoue (Eds.), Proceedings of the 11th International Workshop on Scalable Semantic Web Knowledge Base Systems co-located with 14th International Semantic Web Conference {(ISWC} 2015), Bethlehem, PA, USA, October 11, 2015 (Vol. 1457, pp. 32–47). CEUR-WS.org. http://ceur-ws.org/Vol-1457/SSWS2015_paper3.pdf

Abstract

We are motivated by the goal of representing Wikidata as RDF, which would allow legacy Semantic Web languages, techniques and tools – for example, SPARQL engines – to be used for Wikidata. However, Wikidata annotates statements with qualifiers and references, which require some notion of reification to model in RDF.

We thus investigate four such options: (1) standard reification, (2) n-ary relations, (3) singleton properties, and (4) named graphs. 

five popular SPARQL implementations: 4store, BlazeGraph, GraphDB, Jena TDB and Virtuoso.

Introduction

Indeed, the factual information of Wikidata corresponds quite closely with the RDF data model, where the main data item (entity) can be viewed as the subject of a triple and the attribute–value pairs associated with that item can be mapped naturally to predicates and objects associated with the subject. However, Wikidata also allows editors to annotate attribute–value pairs with additional information, such as qualifiers and references. Qualifiers provide context for the validity of the statement in question, for example providing a time period during which the statement was true. References point to authoritative sources from which the statement can be verified. About half of the statements in Wikidata (32.5 million) already provide a reference, and it is an important goal of the project to further increase this number.

... Named Graphs represents an extension of the traditional triple model,  adding a fourth element; however, the notion of Named Graphs is well-supported in the SPARQL standard [10], and as “RDF Datasets” in RDF 1.1 [5].

Wikidata Data-model

However, the statement is also associated with some qualifiers and their values. Qualifiers are property terms such as start time, follows, etc., whose values may scope the validity of the statement and/or provide additional context. Additionally, statements are often associated with one or more references that support the claims and with a rank that marks the most important statements for a given property. The details are not relevant to our research: we can treat references and ranks as special types of qualifiers. We use the term statement to refer to a primary relation and its associated qualification ...

Conceptually, one could view Wikidata as a “Property Graph”: a directed labelled graph where edges themselves can have attributes and values [11,6]. A related idea would be to consider Wikidata as consisting of quins of the form (s, p, o, q, v), where (s, p, o) refers to the primary relation, q is a qualifier prop- erty, and v is a qualifier value ...  All quins with a common primary relation would constitute a statement. However, quins of this form are not a suitable format for Wikidata since a given primary relation may be associated with different groupings of qualifiers. 

.... For this reason, reification schemes based conceptually on quins – such as RDF* [12,11] – may not be directly suitable for Wikidata.

... we propose to view Wikidata conceptually in terms of two tables: one containing quads of the form (s, p, o, i) where (s, p, o) is a primary relation and i is an identifier for that statement; the other a triple table storing (i) primary relations that can never be qualified (e.g., labels) and thus do not need to be identified, (ii) triples of the form (i, q, v) that specify the qualifiers as- sociated to a statement, and (iii) triples of the form (v, x, y) that further describe the properties of qualifier values. 

Compared to sextuples, the quad/triple schema only costs one additional tuple per statement, will lead to dense instances (even if some qualifiable primary relations are not currently qualified), and will not repeat the primary relation for each qualifier; conversely, the quad/triple schema may require more joins for certain query patterns (e.g., find primary relations with a follows qualifier)

From Higher Arity Data to RDF (and back)

The transformation from Wikidata to RDF can be seen as an instance of schema translation, where these questions then directly relate to the area of relative information capacity in the database community [14,17], which studies how one can translate from one database schema to another, and what sorts of guarantees can be made based on such a translation.

We require that any instance of Wikidata can be mapped to a RDF graph, and that any conjunctive query (CQ; select- project-join query in SQL) over the Wikidata instance can be translated to a conjunctive query over the RDF graph that returns the same answers. We call such a translation query dominating. 

With this scheme, we require 4k triples to encode k quads. This encoding can
be generalised to encode database tables of arbitrary arity, which is essentially the approach taken by the Direct Mapping of relational databases to RDF.

Existing Reification Approaches

SPARQL Querying Experiments

To start with, we took the RDF export of Wikidata from Erxleben et al. [8] (2015-02-23), which was natively in an n-ary relation style format, and built the equivalent data for all four datasets. The number of triples common to all formats was 237.6 million.

Loading data: We selected five RDF engines for experiments: 4store, Blaze- Graph, GraphDB, Jena and Virtuoso. The first step was to load the four datasets for the four models into each engine. We immediately started encountering prob- lems with some of the engines. To quantify these issues, we created three collec- tions of 100,000, 200,000, and 400,000 raw statements and converted them into the four models.6 We then tried to load these twelve files into each engine. ... where we see that: (i) even though different models lead to different triple counts, index sizes were often nearly identical: we believe that since the entropy of the data is quite similar, compression manages to factor out the redundant repetitions in the models; (ii) some of the indexes start with some space allocated, where in fact for BlazeGraph, the initial allocation of disk space (200MB) was not affected by the data loads; (iii) 4store and GraphDB both ran into problems when loading singleton properties, where it seems the indexing schemes used assume a low number of unique predicates.7 With respect to point (iii), given that even small samples lead to blow-ups in index sizes, we decided not to proceed with indexing the singleton properties dataset in 4store or GraphDB.


Benchmark queries: From two online lists of test-case SPARQL queries, we selected a total of 14 benchmark queries....; 


since we need to create four versions of each query for each reification model, we use an abstract quad syntax where necessary, which will be expanded in a model- specific way such that the queries will return the same answers over each model.

 

 Query results: For each engine and each model, we ran the queries sequentially (Q1–14) five times on a cold index. Since the engines had varying degrees of caching behaviour after the first run – which is not the focus of this paper – we present times for the first “cold” run of each query.10 Since we aim to run 14 × 4×5×5 = 1, 400 query executions, to keep the experiment duration manageable, all engines were configured for a timeout of 60 seconds. Since different engines interpret timeouts differently (e.g., connection timeouts, overall timeouts, etc.), we considered any query taking longer than 60 seconds to run as a timeout.

For a given engine and query, we look at which model performed best, counting the number of firsts, seconds, thirds, fourths, failures (fa) and cases where the query could not be run (nr). ... The total column then adds the positions for all engines. From this, we see that looking across all five engines, named graphs is probably the best supported (fastest in 17/70 cases), with standard reification and and n-ary relations not far behind (fastest in 16/70 cases). All engines aside from Virtuoso seem to struggle with singleton properties; presumably these engines make some (arguably naive) assumptions that the number of unique predicates in the indexed data is low.

Virtuoso provides the most reliable/performant results across all models. Jena failed to return answers for singleton properties, timing-out on all queries. ... We see that both BlazeGraph and GraphDB managed to process most of the queries for the indexes we could build, but with longer runtimes than Virtuoso. In general, 4store struggled with the benchmark and returned few valid responses in the allowed time.



Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...