These works range from the publication of different metadata vocabularies (e.g., PROV-O, the Dublin Core Metadata Initiative, and the Data Catalog Vocabulary), the application of these vocabularies in datasets, the development of different metadata representation models (MRM), metadata support by graph backends, and much more.
[Ontologias e Vocabulários para Metadados de Triplas]
In the context of this paper, we focus on metadata representation models (MRM) for knowledge graphs and how data and metadata are connected in the same RDF store.
[KGs em RDF]
As metadata representation model (MRM), we define a strategy of splitting an RDF triple t and its set of key-value based metadata facts m into several triples or quads, such that we can store and query metadata, for all triples individually, in an RDF Store. Handling data and metadata alongside each other can be considered a challenging task. Since more data has to be processed, stored and indexed, a negative impact on the overall system performance might occur.
[Várias soluções de reificação]
SPARQL & RDF store benchmarks: LUBM [4], BSBM [18], and SP2Bench [43], ... Linked Data Benchmark Council (LDBC) [benchmarks par aRDF não consideram a reificação]
However, to the best of our knowledge, there is no benchmark that generates metadata-rich datasets or queries over fine grained meta information.
Metadata handling & extensions in RDF: [propostas de reificação]
MRM studies: [avaliações comparativas entre propostas de reificação]
3.1.1. Metadata usage analysis
We were not able to find any dataset using the Singleton Property MRM. When it comes to RDF reification, we did not find any dataset, which uses rdf:Statement. Looking for the standard reification predicates, allowed us to identify 15 datasets, where the number of occurrences of rdf:subject, rdf:predicate, and rdf:object differed, which indicates wrong usage of standard reification or inconsistencies in the datasets or indexes.
[Pouco usado na prática]
3.1.2. Metadata datasets analysis
Yago 3: is a prominent knowledge base extracted from Wikipedia and other sources [28]. It stores metadata and provenance information per triple using a non-standardized way of assigning triple-ids via turtle comments. The ids are associated with metadata in the same way as in the other MRMs. While there is a source URL and extraction technique recorded for almost every triple, metadata from other dimensions (e.g., geo location, time) is only available for a very small subset of triples.
Wikidata: Since Wikidata is used as one of the evaluation datasets, a detailed explanation will be given in Section 5. For the sake of completeness of the metadata analysis, we would like to mention that Wikidata is characterized by diverse metadata for individual facts, whereas more than one third of the metadata is dedicated to a temporal dimension.
3.2.2. Granularity of metadata
Dataset/Graph level
Entity/Resource level
Triple/Statement level
[Contexto também pode ser especificado nesses níveis? Atualmente só tratei triple/statement e entidade/resource. Como incluir a nível de dataset ou grafo ou subgrafo?]
3.2.3. Query characteristics
... Therefore, an MRM evaluation should consider both data-only queries and also mixed (data-metadata) queries.
3.3.1. Criteria for metadata representations
- Storage cost - evaluates the size of the MRM and its factorization support and will be measured by triple count, as well as serialized file size and overall database size in byte.
- Data-only query overhead/impact - Our evaluation measures query time in ms for data-only queries for the MRMs compared to a baseline query/dataset without MRM specific triples.
- Mixed (metadata and data) query execution time - for a set of query templates over data and metadata we compare the execution time (in ms)
- Usability - We compare the number of variables, triple patterns, and additional SPARQL elements, which are necessary to query a single triple with a metadata fact, as indicators for the query usability/complexity of an MRM.
3.3.2. Criteria for metadata extension and SPARQL implementations
- Bulk load - evaluates (meta)data bulk loading capacities for stores and is measured in milliseconds.
- SPARQL integration/conformance - evaluates integration of store-specific metadata extensions into SPARQL and validates whether stores are able to handle all SPARQL queries correctly
3.3.3. Additional criteria
- Backward compatibility of queries - evaluates data-only queries, which should still work after the addition of metadata without the need to rewrite them.
[Impacto nas consultas nos dados em função da modelagem para incorporar os metadados]
4. Metadata Representation Models
4.1. RDF compliant models
4.1.1. Named graphs (ngraphs)
The Named Graph feature, which is supported by every SPARQL 1.1 compliant graph backend, allows for the assignment of one IRI for one or more triples as a graph id. The same IRI can then be used as a subject for a metadata entity, which itself can store the metadata about the associated triple(s) as predicates and objects. he ngraphs MRM is easy to understand, since it just builds on top of the existing triple format. Hence, it is possible to reuse existing data queries.
The one big drawback of the ngraphs model is the fact that it uses the named graph IRI as a URI for a metadata resource, which itself stores a set of key-value based metadata facts. If the original dataset uses a named graph to store the data facts, then a ngraphs MRM can not be applied in a backward compatible way.
[Usar para reificação a nível de tripla não permite usar para outros propósitos]
4.1.2. RDF standard reification (stdreif)
As specified in the RDF standard, it is possible to create a resource which describes a triple and its subject, predicate, and object. The resource IRI can then be used to connect provenance or meta information with the triple.
Compared to the ngraphs MRM, it is not possible to reuse existing data queries out of the box.
This does not only increase the dataset size, but adds more triple patterns to queries. All four components of a reified resource have to be used as triple patterns in order to find the correct reified triple in the dataset.
[Volume de triplas e complexidade da query]
4.1.3. N-ary relation (naryrel)
In this MRM, a relationship instance is created as a resource of the subject-predicate-pair instead of the object of the triple.
Similar to the standard reification, an IRI can be used to access data and metadata. In the case of the nary-relation MRM, the IRI is a relation resource IRI. Due to the introduction of the relation resource IRI, one more statement per triple value has to be added. Existing data queries cannot be reused, but compared to the standard reification fewer triple patterns are required to access either data values or metadata. Furthermore, it is possible to support datasets with named graphs.
4.1.4. Singleton Property (sgprop)
The singleton property [34] scheme uses a unique property for every triple with associated metadata.
4.1.5. Companion Properties (cpprop)
As was shown in [22], the singleton property representation model suffers from the fact that it creates a new property for every statement in order to create globally unique properties. This results in a very uncommon uniform distribution and large number of properties, and therefore, in increased query times. To limit the creation of new properties and to reduce the influence on the datasets property (frequency) distribution, we propose a novel MRM - companion properties.
4.2. Vendor Specific models
4.2.1. Blazegraph
The Java-based graph store Blazegraph offers a feature called Reification Done Right. It provides an implementation of SPARQL* and RDF*.
Multiple reifications of the same triple are translated into one standard W3C reification rdf:Statement, and therefore, it is not possible to distinguish grouped annotations (e.g., when a confidence score and the tool which produced the data and the score, are stored as individual values, the confidence values only make sense in the scope of the tool) anymore.
RDR Implementation: The reified statement is embedded directly into the representation of each statement about that reified statement. This is achieved by using indices with variable lengths and recursively embedded encodings of the subject and object of a statement.
4.2.2. Virtuoso
In order to use Virtuoso for provenance and metadata scenarios, the RDF-compatible MRMs have to be utilized.
4.2.3. Others
Other RDF store providers have created extensions for storing and retrieving metadata more efficiently. AllegroGraph supports the handling of metadata with its Direct Reification feature, which uses statement identifiers. Stardog allows for the support of metadata by introducing a statement identifier, which is also used to support property graphs. Both systems provide a proprietary SPARQL extension to query a statement identifier.
[StarDog tem ID da aresta]
5. Evaluation Datasets with Metadata
Wikidata: Wikidata is no native RDF dataset, it instead uses its own Wikidata Statement Model [10]. Claims (which are similar to triples/statements in RDF) can be described with so-called qualifiers consisting of keys and values (analogous statement level metadata in RDF). Qualifiers are used to provide a context or scope for a claim (e.g., how long the marriedTo-relation between two persons is valid). In contrast to this factual metadata, there also exists the concept of references, which records provenance for claims.
[Modelo hiper relacional do WD]
Since metadata is modelled on statement level, a statement id is kept for every claim (data triple), but 1.5 million claims have qualifiers, only. The 2.1 million qualifiers are based on 953 distinct qualifier keys .... In the excerpt below, a shortened example of two different claims with metadata about the presidency of Grover Cleveland is given.
Wikipedia history and DBpedia: Since DBpedia data does not come with diverse metadata, we decided to apply the Wikipedia Revision history on top of a company focused dataset.
6.3. Wikidata Scenario
For this work, we extended the scenario, which is described in [22], by evaluating the Blazegraph feature RDR, which had not been studied for the Wikidata use case yet. In order to circumvent the limitation of RDR, that every triple needs to be unique, we do not attach the metadata directly to the data triple. Instead, we use (multiple) statement identifiers as metadata, which are then linked to the actual metadata. This is necessary, to model the Grover Cleveland example shown in Section 5.
The technique is illustrated in the following example:
<< :s :p :o >> :hasMeta :id1 ; :hasMeta :id2 .
[Problema do RDF Star é não ser multigrafo]
6.3.2. Wikidata Templates (Quins Experiment)
A quin represents a data-metadata look-up query, where for a data triple pattern s,p,o the attached metadata key k and its corresponding values v are queried. For this quin pattern (s,p,o,k,v) the authors defined 31 templates based on 31 binary masks of length 5 (from (0,0,0,0,1) to (1,1,1,1,1)), which define whether the corresponding position of the quin deals as a constant or as a variable in the query.
[Posso usar esse padrão para testar as queries de contexto mas seriam mais posições ou o k e p derivam da dimensão contextual?]
7.1. Factorization study and qualitative MRM comparison
Factorization: As already highlighted, the MRMs ngraphs and cpprop are capable of efficiently storing all granularity levels. However, the remaining MRMs do not support factorization, since they are intended for statement level granularity.
Query Complexity/Usability: As outlined in Section 4, the data layout and the query structure differ between MRMs, hence the query complexity has to be analyzed for each MRM. First the number of triple patterns and the number of extra SPARQL elements is considered. This gives an indication of how easy it is to read, understand and create a search query. Hence, it might have an impact on how an MRM is going to be adopted by other practitioners.
8. Conclusion and Future Work
Additionally, we introduced a novel metadata representation model called Companion Properties, which has been proven to be a good alternative to existing triple based MRMs for DBQ Queries, even outperforming ngraphs in Stardog.
The results clearly show, that ngraphs outperforms the other MRMs for challenging mixed queries, which confirms the results presented in [22] for more complex templates than the quin queries. As long as the use case or source dataset does not require the usage of quads, ngraphs is the most suitable solution.
[O outro problema do Named Graphs são as consultas de caminho em SPARQL]
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.