Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs- Leitura de Artigo
Sikos, L.F., Philp, D. Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs. Data Sci. Eng. 5, 293–316 (2020). https://doi.org/10.1007/s41019-020-00118-0
Abstract
... However, the standard used for representing these triples, RDF, inherently lacks the mechanism to attach provenance data, which would be crucial to make automatically generated and/or processed data authoritative.
[proveniência é um tipo de meta informação de contexto pq o fato representado depende da fonte para ser interpretado como válido]
This paper is a critical review of data models, annotation frameworks, knowledge organization systems, serialization syntaxes, and algebras that enable provenance-aware RDF statements. The various approaches are assessed in terms of standard compliance, formal semantics, tuple type, vocabulary term usage, blank nodes, provenance granularity, and scalability. This can be used to advance existing solutions and help implementers to select the most suitable approach (or a combination of approaches) for their applications. Moreover, the analysis of the mechanisms and their limitations highlighted in this paper can serve as the basis for novel approaches in RDF-powered applications with increasing provenance needs.
[O que for proposto para proveniência em RDF pode ser avaliado para outros contextos]
Introduction to RDF Provenance
These benefits make RDF appealing for a wide range of applications; however, RDF has shortcomings when it comes to encapsulating metadata to statements. With the proliferation of heterogeneous structured data sources, such as triplestores and LOD datasets, capturing data provenance, i.e., the origin or source of data, and the technique used to extract it, is becoming more and more important, because it enables the verification of data, the assessment of reliability, the analysis of the processes that generated the data, decision support for areas such as cybersecurity, cyberthreat intelligence, and cyber-situational awareness, and helps express trustworthiness, uncertainty, and data quality.
[Motivação: KG hiper relacionais para integração de várias fontes de dados]
Yet, the RDF data model does not have a built-in mechanism to attach provenance to triples or elements of triples. Consequently, representing provenance data with RDF triples is a long-standing, non-trivial problem. While the World Wide Web Consortium (W3C) suggested RDF extensions in 2010 to support provenance in the upcoming version of the standard, none of these have been implemented in the next release in 2014, namely in RDF 1.1.
[Reificação, padrão W3C, outras propostas, RDF*]
Formal Representation of RDF Data Provenance
RDF reification refers to making an RDF statement about another RDF statement by instantiating the rdf:Statement class and using the rdf:subject, rdf:predicate, and rdf:object properties of the standard RDF vocabulary to identify the elements of the triple. It is the only standard syntax to capture RDF provenance and the only syntax with which all RDF systems are compatible.
However, reification has no formal semantics, and leads to a high increase in the number of triples, hence, it does not scale well. After all, reification requires a statement about the subject, another statement about the predicate, and a third statement about the object of the triple, plus at least one more statement that captures provenance, i.e., the number of statements in the dataset will increase at least four times. This “triple bloat” is one of the main reasons for the unpopularity of reification.
[Reificação e o problema no aumento do número de triplas]
Nevertheless, reification has another shortcoming: writing queries to retrieve statement-level provenance data is cumbersome, because an additional subexpression has to complement the provenance-related subexpressions in queries to be able to match the reification triples. For these reasons, some proposed reification to be deprecated.
[Reificação e o problema da complexidade das queries SPARQL]
The other approach suggested by the W3C to define additional attributes, including provenance, to RDF triples, is called n-ary relations, which provides a mechanism to describe the relationship of an individual with more than one other individual or data type value ..
Provenance Granularity
The following six levels of provenance granularity can be differentiated from course-grained to fine-grained, depending on the smallest set of represented information for which provenance can be defined:
1. Dataset-level provenance: the provenance of Linked (Open) Data datasets. Every statement is globally dereferencable.
2. RDF document-level provenance: the provenance of RDF statements stored in the same file.
3. Graph-level provenance: statements are made to capture the provenance of named graphs, whose URIs are utilized in quadruples to declare coarse provenance information. ...
4. Molecule-level provenance: RDF molecules introduce a granularity level finer than named graphs but coarser than triples, constituting the finest components of lossless RDF graph decomposition for provenance tracking situations when graph-level provenance would result in low recall and triple-level provenance in low precision. ...
5. Triple-level provenance: provenance information is provided for RDF triples. This is the most common provenance level for RDF data, because it can represent the provenance of statements, which is adequate for a number of applications.
6. Element-level provenance: fine-grained provenance that enables to track how individual elements of RDF triples have been derived from other RDF triple elements. Many mechanisms to capture provenance cannot assign provenance to arbitrary statement elements, i.e., subjects, predicates, and objects of RDF triples, only to one of them. Statement-element-level provenance is useful for representing various claims of disputed or uncertain information from diverse sources.
[Diferentes níveis de granularidade para associar dados de proveniência. Só estou tratando a nível de "tripla" / fato, talvez o aninhamento hiper relacional possa tratar o diferentes níveis superiores]
Knowledge Organization Systems for Provenance
Knowledge organization systems designed for working with RDF data provenance include purpose-built and related controlled vocabularies and ontologies.
[Qualificadores mapeados em URIs de propriedades presentes em ontologias de proveniência. Nem sempre KG tem uma ontologia modelada ou reusada]
Provenance Vocabularies and Ontologies
Various vocabularies and ontologies are available for representing specific types and aspects of provenance information, such as attributes, characteristics, licensing, versioning [79], proof [80], and entailment. These include upper ontologies, which can be used across knowledge domains, domain ontologies that provide provenance terms for specific knowledge domains, and provenance-related ontologies, which define terms often captured together with provenance, such as to capture trust and licensing information.
Upper Ontologies for Provenance There is a wide range of domain-agnostic ontologies to represent generic provenance data. The core data model for provenance, PROV, was standardized in 2013 by the W3C ...
Domain Ontologies for Provenance Domain-aware provenance ontologies can be used to represent provenance for specific knowledge domains, e.g., broadcasting, workflows, and scientific processes. Provenance-Related Ontologies Dublin Core (DC), standardized in ISO 15836-1:2017, is a set of 25 elements that can be broadly classified as provenance-related, including one generic term, namely, provenance, and terms of three specific provenance categories: terms that capture who affected a change (contributor, creator, publisher, rightsHolder), terms to answer questions about when a change was affected (available, created, date, dateAccepted, dateCopyrighted, dateSubmitted, issued, modified, valid), and terms that can be used to describe how a change was affected (isVersionOf, hasVersion, isFormatOf, hasFormat, license, references, isReferencedBy, replaces, isReplacedBy, rights, source). The DC Terms can partially be mapped to PROV-O terms [95]. Creative Commons is an RDFS ontology for describing licensing information, some of which are provenance-related.
Provenance-Aware RDF Data in Graph Databases
AllegroGraph ... support for additional fields at the triple level, making it possible to define permissions, trust, and provenance data for source tracking, quality evaluation, and access control. AllegroGraph supports a format called Extended N-Quad, or NQX for short, which extends the standard N-Quads format to allow the specification of optional attributes for each triple or quad in JSON format. NQX allows an arbitrary number of attributes and an arbitrary number of attribute values, with a maximum attribute size limited only by the amount of available virtual memory and address space (theoretically up to approximately 1TB).
[Modelos RDF e NG podem possuir pares de chaves e valor associados as triplas / quads]
Neo4j is a graph database, which employs a property graph model. This model allows the definition of properties for both roles and relationships, and labels to assign roles or types to nodes. These features are suitable for, among others, storing provenance data, as evidenced by implementations such as the CAPS framework [103] and MITRE’s provenance management software, PLUS.
[LPG permite propriedades nas arestas]
OpenLink Virtuoso supports additional metadata to be stored with RDF triples, which can be used for representing provenance data [104]. However, how to add provenance data to triples in Virtuoso is not trivial, because it requires a kind of mechanism that extends the standard SPARQL query syntax [105].
Provenance-Aware LOD Datasets
The provenance information provided by named graphs indicate the current location of data, or the data source described by provenance graphs [111], but does not hold information about the behavior of processes that interact with Linked Data, which can be captured using additional syntax and semantics only [112].
Tracking data provenance may require both generic and domain-specific provenance data to support future reuse via querying, and provenance traces from diverse resources often require preservation and interconnection to support future aggregation and comparison [118]. Provenance-aware Linked Data querying consists of a workload query and a provenance query [119], which can be executed with various strategies, such as the following [120]:
- Post-filtering: the independent execution of the workload and provenance query;
- Query rewriting: the execution of the provenance query precedes the workload query, making it possible to utilize context values returned by the provenance query to filter out those tuples that do not conform to the provenance results;
- Full materialization: the provenance query is executed on the entire dataset or any relevant subset of it, and materializes all tuples whose context values satisfy the provenance query;
- Pre-filtering: a provenance index is located for each context value and identifier of those tuples that belong to the context;
- Adaptive partial materialization: introduced a trade-off between the performance of the provenance query and that of the workload query
[Como manipular os metadados de proveniência armazenados]
Querying RDF Provenance
Software Tools for Manipulating RDF Provenance
Provenance Applications
Performance Comparison of RDF Data Provenance Approaches
The syntactic differences between the presented approaches and techniques are not always accompanied by semantic differences. ... possible to convert provenance-aware RDF data between them without losing semantics, as long as the datatype and value range are not stricter for one than the other (e.g., N-Quads and named graphs).
Comparing the performance of RDF data provenance approaches is not trivial ....
Experiments
The following types of domain-independent provenance queries were formed to provide a simple quantitative comparison:
Query 1: select all provenance statements [BGP Look up]
Query 2: select all triples for a given data source [BGP com filtro no qualificador]
Query 3: select all data sources for a given subject [BGP com filtro no sujeito e recuperando o valor de um qualificador]
Query 4: select all triples for a specific predicate ordered by time [CGP com order by no qualificador e filtro no predicado]
Query 5: select all triples derived from a specific location at specific time [BGP com filtro em dois qualificadores]
[Benchmark para avaliar performance de consultas que manipulam contexto em qualquer domínio. Não tem consulta NGP (property path), caminho variável]
Conclusions
At a higher level of abstraction, there is a variety of knowledge organization systems that can be utilized in capturing provenance-aware RDF statements, including purpose-built controlled vocabularies and ontologies, and ontologies designed for general or other types of metadata. Storing provenance-aware RDF statements requires solutions that go beyond the capabilities of conventional triplestores, and either encapsulate metadata with the triples, or store more than three columns per statement to capture provenance (quadstores, graph databases). This paper enumerated these solutions, and reviewed how to run queries on provenance-aware RDF statements not only on a single, but also on multiple datasets (federated queries), including update operations.
The research interest in RDF data provenance indicates the importance of this field, for intelligent systems implementing Semantic Web standards need provenance manipulating capabilities to be viable, particularly in systems where RDF triples are derived from diverse sources, are generated and processed on the fly, or modified via update queries.
[A importancia do contexto de proveniência em dados LODs que formam KG para integração de dados]
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.