Graph? Yes! Which one? Help! - Leitura de Artigo

Abstract. Amazon Neptune is a graph database service that supports two graph (meta)models: W3C’s Resource Description Framework (RDF) [8] and Labeled Property Graphs (LPG) [15,13].

The choice between the two technology stacks is difficult and requires consideration of data modeling aspects, query language features, their adequacy for current and future use cases, as well as many other factors (including developer preferences).

In this paper, we advocate and explore the idea of a single, unified graph data model that embraces both RDF and LPGs, and naturally supports different graph query languages on top. We investigate obstacles towards unifying the two graph data models, and propose an initial unifying model, dubbed “one graph” (“1G” for short), as the basis for moving forward.

1 Introduction

RDF also supports the reuse of schemas in the form of vocabularies for ontology definition as well as logical reasoning. Not surprisingly, we often see information architects prefer the features of the RDF model because of a good fit with use cases for data alignment, master data management, and data exchange.

** Integração de informações usando um modelo de dados como uma ontologia **

LPGs, on the other hand, are more in line with familiar programming models, and offer good integration with a number of programming languages. Software developers often choose an LPG language because they find it more natural and more “compatible” with their programming paradigm.

** Alinhado com o paradigma OO **

Developers coming from the SQL world often like that a vertex in an LPG is much like a row in a relational database.

** No grafo RDF os object properties e os data properties não são diferenciados, tudo é tripla **

... there are often strong preferences for a particular query language, and there are also situations where one query language is simply better suited because of its particular features (e.g., expressiveness of graph traversals and path queries in Gremlin vs. SERVICE federation in SPARQL [12]).

** A linguagem de manipulação tem forte influencia na escolha do modelo **

Therefore, in this paper we examine the idea of graph interoperability. That is, removing the obstacles that prevent us from using SPARQL [4] over LPGs, Gremlin [14] or openCypher over RDF, etc. The goal is not merely to be able to cross-use query languages, but to be able to do it in a manner where the user does not have to be cognizant (and careful) about how the interoperability is achieved. In other words, we are interested in a data model that combines both RDF and LPG into a model providing a unified semantics that includes (and generalizes) the specifics of the individual models.

** Uma "camada" como um modelo unificado que poderia ser manipulado com qq linguagem de grafo de modo transparente (ou opaco no sentido literal ...) **

In addition to flexibility in choosing the query language, the idea is that this unified data would also allow graph users to combine and interlink data sets maintained in both RDF and LPG formats. More generally, the idea of providing a unified data model abstracts away the need for customers to choose a data format ahead of time, therefore removing a major obstacle to graph database adoption.

** Evitaria a conversão de um modelo para outro antes de carregar na base **

2 The 1G Model

... graph database (Neptune) that uses a “quad-based” internal representation for both RDF and for LPG, and while this may seem to bias our thinking towards RDF-centric solutions, our broad goal is to not be preferential to either graph model.

In the 1G data model, we represent <graph> data ... using a set of so-called statements of the form
src -label→ value : sid

where src represents a source vertex, label denotes the edge or property label, value denotes either the target vertex or the property value (such as a string, number, etc.), and sid assigns a globally unique ID ....

** Id Vértice x Label da Relação x (Id Vértice v Literal) x Id Aresta **
** Padrão de quad como em outros modelos, p ex KGTK usa <node1, label, node2, ID> **

For RDF, 1G statements map directly to SPO triples with unique ID;

** Id Vértice (Sujeito) x Label da Relação (Predicado) x (Id Vértice v Literal) (Objeto) x Id Aresta **

... for LPGs, they give us a unified mechanism to represent edges and properties alike. If the type of value is fixed (to either a vertex or a property value), we may refer to statements as edges and properties, respectively.

** Id Vértice Origem x Label da Relação x Id Vértice Destino x Id Aresta Relação**

** Id Vértice Origem x Label da Propriedade x Literal (Valor da Propriedade) x Id Aresta Propriedade **

** Id Aresta Relação x Label da Propriedade x Literal (Valor da Propriedade) x Id Aresta Propriedade **

The sids can be used in either src or value position, to refer to the statements that they represent (say, to make statements about these statements).

** Nesse aspecto se assemelha do RDF-Star e pode ser usado para suportar hiper relational KG **

Conceptually, we distinguish between ground statements (or ground edges and properties, if their value type is known) as statements that do not contain sids, and assertions, represented as statements with sids in src and/or value position.

** Adicionalmente permite diferenciar arestas sobre fatos (statements) de qualificadores (assertions) **

Interoperability between graphs can now be defined as “views” of the 1G model for RDF, RDF-star [5], and LPG. Those views can be understood as specific interpretations (or mappings) of the generic 1G model into a specific graph data model.

** O modelo 1G seria Conceitual e os modelos RDF, RDF-Star e LPG seriam lógicos enquanto que o Quad é o modelo físico **

We sketch this idea by providing possible interpretations for our toy graph, for RDF, RDF-star, and LPG, respectively:

# Plain RDF without reification (no support for edge properties)
:Alice :knows :Bob .
:Alice :name "Alice" .
:Bob :name "Bob" .

# RDF-star: edge properties = statements about "reified" triples
:Alice :knows :Bob .
<<:Alice :knows :Bob>> :since 2020 .
:Alice :name "Alice" .
:Bob :name "Bob" .

With LPG, assertions over edges are viewed as edge properties, so our LPG interpretation of the 1G model consists of the following vertices and edges:

v1 with id = Alice and properties {name => "Alice"}
v2 with id = Bob and properties {name => "Bob"}
e1 from v1 to v2 with label knows and properties {since => 2020}

** É possível expressar representar qualificadores com RDF-Star e LPG no modelo 1G mas no RDF esses qualificadores não existem, ou seja, não convertem para uma opção de reificação **

3 Interoperability challenges and some proposed solutions

3.1 Challenge #1: Edge properties, multiple edge instances, and reification

A related issue is that LPGs support more than one instance of an edge. The RDF-star effort is adding edge properties to RDF, but – as the draft specification currently stands – not in a way that is completely compatible with LPGs. In RDF-star as currently formulated, a triple < s, p, o > is understood to be unique, in the sense that there cannot exist an identical instance of that triple with its own identity. In LPGs, each edge is considered a unique object with its own identity, and it is perfectly possible to have two edges with identical endpoints and label.

This means that with respect to edge properties (and some other issues notwithstanding) we can take an RDF(-star) graph and map it onto an LPG, but there are cases where the converse is not true, at least not without falling back on more complex modeling approaches (such as introducing intermediate nodes or
using named graphs to attach custom statement identifiers).

** A interoperabilidade RDF-Star x LPG é parcial **

... what an RDF-star view (where the difference cannot be expressed) over such data would look like. One option is to collapse multiple edges (in our example, sid1 and sid2) into a single edge, effectively exposing a single edge with four edge properties when looking at 1G from an RDF-star perspective. This may still allow for some useful SPARQL-star queries (e.g., a query such as “give me all ground facts stated by :NYTimes”) but in the general case information would be lost. Conceptually, this could be seen as a “dimensional reduction” when mapping 1G into RDF-star, which is accounting for the fact that RDF-star is just not expressive enough to capture the distinction between these two edges.

** Existe perda de semântica ao converter do LPG para o RDF-Star **

... as multi-level reification (which can be expressed with RDF-star but not with LPGs), questions whether the reification of a 1G statement via its sid necessarily implies the presence of the underlying statement as a ground statement or not (in our 1G model it does, and in fact the model is not capable of referring to non-asserted statements). While those may sound like technical details, we believe that the challenge of overcoming these “reification gaps” is at the very heart of graph interoperability.

** Outros problemas da conversão RDF-Star para LPG: aninhamento de qualificadores, triplas que não existem isoladamente, ou seja, sem estar vinculada a qualificador **

3.2 Challenge #2: Triples vs. graph abstraction

At the very basic level, an RDF graph is defined as a “set of triples”, whereas LPGs are defined as (optionally labeled) nodes with properties that can be connected via labeled edges. The LPG notion of a vertex label is very similar to that of rdf:type. That is, LPG labels classify a vertex.

While the mapping of triples to and from our abstract 1G model is straightforward (in the absence of reification, there is one 1G statement for every triple, with a unique sid assigned), the mapping from LPGs into the 1G model is more challenging. For instance, certain flavors of LPGs allow “stand-alone” nodes without labels and properties, i.e., a node itself that neither carries any information nor is connected to any other node.In fact, such nodes can be extracted via query languages (in Gremlin
via g.V() or in openCypher via MATCH (n)), but there is no natural representation for these stand-alone nodes in our 1G model. In Gremlin and openCypher, vertices always have at least a default label, however.

** Nos grafos RDF uma nó não precisa ter o rdf:type e nem o rdf:label mas precisa pertencer a alguma tripla pq não é possível expressar um nó isolado do grafo **

Similar questions arise when querying data originating from RDF graphs via LPG query languages. LPG languages typically follow the abstraction layer of vertices and edges. Gremlin, for instance, allows to enumerate all vertices using the query g.V() whereas edges can be extracted via g.E(). In the LPG model, the sets of vertices and edges are disjoint.

The key challenge in defining the semantics of LPG query languages becomes defining the concepts of vertices and edges over our 1G model – which would manifest itself in a formal mapping from 1G to LPGs. Concretely, the question becomes how we define the concepts of vertices, edges, vertex properties, and edge properties in a 1G-to-LPG mapping. The basic idea here would be to map identifiers from src and value position of 1G statements to vertices, and use the sids as edge or property identifiers (depending on whether a statement carries an identifier (such as Alice or Bob) or a literal (such as a string "Alice" or "Bob") in its value position, respectively).

** O mapeamento dos elementos vai permitir a consulta com Gremilin e Cypher **

3.3 Challenge #3: Datatype alignment

1G requires a unified type system over RDF and LPGs. As a W3C standard it builds upon the XML Schema definition (XSD) [2] and utilizes primitive XSD datatypes such as strings, numbers, and dates. Since RDF is defined along an Open World paradigm, its datatypes tend to be more extensible and flexible than LPG types. For instance, there is no validation in place that restricts users from adding ill-typed values such as "this is not an integer"∧∧xsd:integer, and users can also provide language tags for simple string values. Composite types such as lists, bags, and sequences are not available as “primary” literal types but need to be modeled explicitly using RDF containers

** Os tipos de dados em RDF estão bem definidos e são extensíveis **

For LPGs, on the other hand, there exists (to the best of our knowledge) no formal definition of the type system. In contrast to RDF, LPGs support different sets of composite datatypes as built-in types. This means that, while RDF uses the graph structure to model composite types, in LPGs the attribute value itself is an instance of a composite type (a list, map, etc.). This reflects the general notion of semi-structured JSON documents as attribute values. Semantics of these datatypes, however, are opaque and are typically “delegated” to the underlying implementation language, making it potentially hard to unify graph representation. Generally, we have a menagerie of datatypes to try to reconcile, and needless “baggage” because of the reliance on implementation languages

** No modelo LPG existe opção compostas nativas mas a formalização dos tipos de dados é precária **

3.4 Challenge #4: Graph partitioning

SPARQL defines the notion of named graphs, which are often used to support subgraph management use cases. Named graphs are usually thought of as an extension of the triple model to a quad model with the addition of a (sub)graph identifier. Some users have chosen to treat named graphs as containers (sometimes containers of a single triple) to make “statements about statements” (or sets of statements) in lieu of using the reification mechanism (this, in the absence of the proposed RDF-star scenario). This is outside the defined formal semantics of RDF, since named graphs do not have any semantic theory in the RDF model.

** O uso do named graph para reificação acabou alterando o seu propósito original **

Rather than having an RDF quad like < s, p, o, g >, in 1G we introduce a “membership relation” (in this paper we use inGraph as a reserved label). The motivation behind this approach is to restore symmetry to the data model instead of privileging named graphs as somehow special. Named graphs are just an application of the statement identifier, ...

From the standpoint of SPARQL semantics, the membership triples are not visible, but instead could be considered an implementation artifact. However, with this proposal they would be part of the data in the data set. SPARQL default graph and named graph semantics are recovered by performing the appropriate operations over the logical model. Again, we do not imply that the physical schema needs to have this same data organization. Since in our proposed solution we treat named graph membership effectively as an “edge or meta property”, this approach can be extended to LPGs.

** Restaura o uso original dos named graphs mas com a criação de tripla adicional usando o predicado reservado inGraph **

3.5 Challenge #5: Graph merging and external identifiers

This is one of the strongest benefits of RDF, and conversely one of the weakest aspects of LPGs: RDF has a specification for graph merging

Allowing both RDF and LPG data to be represented in a single model requires the co-existence of global identifiers (i.e., IRIs coming from RDF data) and local identifiers (i.e., node and edge identifiers in the form of strings, coming from LPG data). Such a distinction would make it possible to load data via both RDF and LPG data formats into the same logical graph; this would give us mere co-existence, without any (initial) overlap in vertex identifiers, labels, etc.

** Coexistência de padrões de identificadores com escopos distintos **

The typical use case, however, goes beyond just the mere co-existence of local and global identifiers: a common use case would be to unify elements (such as nodes and edges) originating from RDF and LPGs.

** Identificadores distintos podem se referir ao mesmo objeto do mundo real **

3.6 Challenge #6: Lack of formal foundation

Unlike SPARQL or SQL, LPG query languages – by and large – lack strict formal semantics (in the form of, say, a query algebra). This makes it hard to assess semantic compatibility. Similarly, unlike RDF, formal semantics do not exist for LPGs (or only exist post hoc, as in [17,10]). For LPG query languages, semantics is typically defined informally either via documentation and examples, or via an implementation. Our Gremlin implementation, for instance, is largely based on our interpretation of the informal Tinkerpop specification – which comes with details that are not unambiguously defined – and its reference implementation.

** Falta definição formal do modelo LPG e das linguagens GQL
** Não comenta sobre a iniciativa da ISO de unificação e formalização da linguagem GQL **

3.7 Challenge #7: Update query semantics

In order to be able to subsume both the RDF(-star) and the LPG models, a unifying graph model needs to be as expressive as the “most expressive” model, in each of the considered dimensions. As we have illustrated in previous examples, certain extensions that are defined for the more expressive model may not have a natural representation in the less expressive model, thus introducing “dimensions” that are invisible when looking at the data from the less expressive model’s perspective. While read query semantics can be unambiguously defined by mapping the 1G model to a lower-dimensional level, the situation becomes more complex for queries that manipulate the data.

** O processo de atualização deve considerar as diferenças de expressividade dos modelos **

4 Conclusions and the way forward

A recent survey [1] of organizations working on (or considering to adopt) knowledge graphs found that interoperability and standards are the highest priority among survey respondents. Data integration was seen as the dominant use case. These findings could be interpreted to suggest a need for RDF/LPG compatibility and unification. While it may seem that making RDF and LPGs fully compatible is not possible (as per the official RDF specifications and the emerging RDF-star work), we believe there is a way forward. Minimally, we must address the challenges of edge identity (multiple similar edges), graph merging, and well-defined semantics for updates across languages. One way forward would be to define some kind of “compatibility subset” to cover enough ground so that most RDF and LPG applications would work with no or minimal modifications. Lack of interoperability slows the overall adoption of graph technologies, and thus should be a high-priority item to be addressed by the broader graph community.

** O Amazon Neptune atualmente suporta tanto o modelo LGP quanto o RDF mas ainda não implementou o modelo 1G e a especificação da linguagem para manipulação do 1G e dos mapeamentos em relação aos modelos ainda está em aberto **

Sobre o AMAZON NEPTUNE

https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html

https://docs.aws.amazon.com/neptune/latest/userguide/graph-database-query-languages.html

Pesquisa de Doutorado da Veronica

Pesquisar este blog

Graph? Yes! Which one? Help! - Leitura de Artigo

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

Exploratory Search: From Finding to Understanding - Leitura de Artigo