@article{PG-Schemas2023,
author = {Angles, Renzo and Bonifati, Angela and Dumbrava, Stefania and Fletcher, George and Green, Alastair and Hidders, Jan and Li, Bei and Libkin, Leonid and Marsault, Victor and Martens, Wim and Murlak, Filip and Plantikow, Stefan and Savkovic, Ognjen and Schmidt, Michael and Sequeda, Juan and Staworko, Slawek and Tomaszuk, Dominik and Voigt, Hannes and Vrgoc, Domagoj and Wu, Mingxi and Zivkovic, Dusan},
title = {PG-Schema: Schemas for Property Graphs},
year = {2023},
issue_date = {June 2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {1},
number = {2},
url = {https://doi.org/10.1145/3589778},
doi = {10.1145/3589778},
journal = {Proc. ACM Manag. Data},
month = {jun},
articleno = {198},
numpages = {25},
keywords = {property graphs, schemas, graph databases}
}
ABSTRACT
Property graphs have reached a high level of maturity, witnessed by multiple robust graph database systems as well as the ongoing ISO standardization effort aiming at creating a new standard Graph Query Language (GQL). Yet, despite documented demand, schema support is limited both in existing systems and in the first version of the GQL Standard. It is anticipated that the second version of the GQL Standard will include a rich DDL.
[Definição de esquema ... poderia agregar elementos de contexto na opção descritiva pelo menos?]
1 INTRODUCTION
The property graph data model ... to represent interconnected multi-labeled data enhanced with properties given by key/value pairs.
In the schema-first scenario, dominating in production settings of stable systems, schema is provided during the setup and plays a prescriptive role, limiting data modifications. In the flexible schema scenario, suitable for rapid application development and data integration, schema information comes together with data and plays a descriptive role, telling users and systems what to expect in the data.
[schemafull x schemaless. schema-on-write x schema-on-read. O esquema de contexto é descritivo (não é prescritivo) e usado em tempo de consulta e não de inserção dos dados]
To illustrate the key features required of schemas for property graphs we now provide a concrete example of a schema in fraud detection, a common application of graph databases [26], and show how the schema can be used in interactive graph exploration, whichin itself is a common functionality provided by graph databases [15, 16, 27, 28, 40].
[Exploração de grafos é uma funcionalidade comum, usual dos GraphDBs]
... Being aware of the schema, the graph explorer leverages the node type definitions to construct a start page that proposes search for any of the entities available in the domain ... Based on the schema information, the graph explorer dynamically constructs a Customer search form. It contains separate search fields for the known properties of Customer nodes ...
[O usuário tem que conhecer o esquema do grafo para explorar]
As illustrated by this sample session, the graph explorer would not have been able to effectively guide Andrea through the exploration without concrete schema information. The suggestion of property specific search restrictions is made possible by content types. The schema-assisted query formulation leverages node and edge types.
2 DESIGN REQUIREMENTS
2.1 Property Graphs and Database Schemas
Flexibility. Graph databases are aligned with an iterative and incremental development methodology because they do not require a rigid schema-based data governance mechanism, but rather favor test-driven development, which embraces the additive nature of graphs.
Database schemas have a number of important functions that we split into two general categories [12].
Descriptive function. Schemas provide a key to understanding the semantics of the data stored in a database. More precisely, a schema allows to construct a (mental) map between real-world information and structured data used to represent it. This knowledge is essential for any user and application that wishes to access and potentially modify the information stored in a database.
Prescriptive function. A schema is a contract between the database and its users that provides guarantees for reading from the database and limits the possible data manipulations that write to the database. To ensure that the contract is respected, a mandatory schema can be enforced by the database management system.
2.2 Types and Constraints
Type is a property that is assigned to elements (data values, nodes, edges) of a property graph database. Types group together similar elements that represent the same kind of real-world object and/or that share common properties, e.g., the set of applicable operations and the types of their results.
Constraint is a closed formula over a vocabulary that permits quantification over elements of the same type. The purpose of constraints is to impose limitations and to express semantic information about real-world objects.
[As definições de contexto poderiam ser constraints no modo prescritivo]
2.3 Requirements
Property Graph Types. The descriptive function of schemas can be particularly beneficial to the agility of property graph databases. Indeed, agility requires a good grasp of the correspondence between database objects and real-world entities, which is precisely the descriptive function of schemas.
R1 Node types. Schemas must allow defining types for nodes that specify their labels and properties.
R2 Edge types. Schemas must allow defining types for edges that specify their labels and properties as well as the types of incident nodes.
[O problema do modelo LPG é não permitir que as propriedades também sejam arestas que ligam outros nós do grafo]
R3 Content types. Schemas must support a practical repertoire of data types in content types.
Property Graph Constraints.
R4 Key constraints. Schemas must allow specifying key constraints on sets of nodes or edges of a given type.
R5 Participation constraints. Schemas must allow specifying participation constraints
R6 Type hierarchies. Schemas must allow specifying type hierarchies.
Flexibility.
R7 Evolving data. Schemas must allow defining node, edge, and content types with a finely-grained degree of flexibility in the face of evolving data.
R8 Compositionality. Schemas must provide a fine-grained mechanism for compositions of compatible types of nodes and edges.
Usability.
Furthermore, schemas must be easy to derive from graph instances and validation of graph instances with respect to schemas must be efficient. These basic requirements are fundamental for the practical success of any schema solution, as we saw in Section 1.
[Deve ser possível extrair o esquema dos dados assim como validar os dados usando um esquema]
R9 Schema generation. There should be an intuitive easy-to-derive constraint-free schema for each property graph that can serve as a descriptive schema in case one is not specified.
R10 Syntax and semantics. The schema language must have an intuitive declarative syntax and a well-defined semantics.
R11 Validation. Schemas must allow efficient validation and validation error reporting.
[R12 A linguagem de consulta deve recuperar informações do esquema e sinalizar elementos faltantes ou incorretos, sem que o usuário tenha que elaborar queries específicas para tal]
3 DATA MODEL
Definition 3.1 (Property Graph). A property graph is defined as a tuple 𝐺 = (𝑁 , 𝐸, 𝜌, 𝜆, 𝜋) where:
• 𝑁 is a finite set of nodes;
• 𝐸 is a finite set of edges such that 𝑁 ∩ 𝐸 = ∅;
• 𝜌 : 𝐸 → (𝑁 × 𝑁 ) is a total function mapping edges to ordered pairs of nodes (the endpoints of the edge);
[poderia ser n-ária]
• 𝜆 : (𝑁 ∪ 𝐸) → 2L is a total function mapping nodes and edgesto finite sets of labels (including the empty set);
• 𝜋 : (𝑁 ∪ 𝐸) → R is a function mapping nodes and edges to records.
4 PG-SCHEMAS
4.1 PG-Types by Example
We first discuss the basic ingredients of PG-Types (node types, edge types, and graph types) and then move on to more sophisticated aspects such as inheritance and abstract types.
Generally, there are two main options for creating types in schemas. One can create open types and closed types. Both kinds of types are able to specify content that they require to be present. The difference between the two is what they allow in addition to the explicitly mentioned content: closed types forbid any content that is not explicitly mentioned, whereas open types allow any such content. Closed types are what we have in SQL, but also in programming languages such as C++ and Java. Open types are the default in JSON Schema
[Fechado: só permite o que está na definição e tudo que está na definição deve ser atendido (mínimo e máximo). Aberto: permite além da definição desde que atenda a definição (mínimo).]
Nodes in property graphs carry sets of labels. In PG-Types, we can associate multiple labels to a node type using the &-operator:
( customerType : Person & Customer
{ name STRING , OPTIONAL since DATE })
The node type customerType requires nodes to carry both labels Person and Customer, and no other labels.
The graph type fraudGraphType contains three node types and one edge type. The keyword STRICT specifies how a property graph should be typed against the schema. It means that, for a graph 𝐺 to be valid w.r.t. fraudGraphType, it should be possible to assign at least one type within fraudGraphType to every node and every edge of 𝐺. The alternative, LOOSE, allows for partial validation, addressing R7. Informally, it means that the validation process simply assigns types to as many nodes and edges in the graph as possible, but without the restriction that every node or edge should receive at least one type. We discuss this further in Section 4.2.
[Strito: todo nó e aresta deve pertencer a pelo menos um tipo do esquema do grafo (mínimo). Loose: nós e arestas podem ter tipos que não estão no esquema do grafo (frouxo)]
4.2 Formal Definition and Semantics
4.3 Validation and Graph Type Generation
The validation for general graph types, defined with the syntax in Section 4.1, can be accomplished efficiently with an analogous procedure thanks to the mathematical simplicity of the schema compilation rules in Section 4.2. More importantly, such a validation procedure can be implemented in a reasonably expressive graph query language. In essence, such a language would need to support standard set operations and would need to allow identification of nodes and edges based on their labels, property names, and property value types. Consequently, the proposed graph schema formalism satisfies requirement R11.
4.4 Adding Constraints
To this end, we leverage existing work on keys for property graphs [4], called PG-Keys. Despite their name, PG-Keys go beyond the capability of expressing key constraints. Statements in PG-Keys are of the form
FOR 𝑝 (𝑥) < qualifier > 𝑞 (𝑥, ̄~) ,
where <qualifier> specifies the kind of constraint that is being expressed and consists of combinations of EXCLUSIVE, MANDATORY, and SINGLETON. Both 𝑝 (𝑥) and 𝑞(𝑥, ̄~) are queries.
5 RELATIONSHIP TO OTHER PARADIGMS
- ... Conceptual data modelssuch as the Entity-Relationship Model [17] (Chen ER), its extensions, i.e., the Extended Entity-Relationship Model [54] (Extended ER) and the Enhanced Entity-Relationship Models [22] (Enhanced ER), as well as ORM2 diagrams [30] and UML Class diagrams [24].
- ... Graph schemas stemming from theSemantic Web setting, by reviewing RDF Schema (RDFS) [14], the Web Ontology Language (OWL) [33], SHACL [18, 36], and ShEx [51].
- ...Tree-structured dataformalisms. These are XML, together with its main schema languages (Document Type Definition (DTD) [62], XML Schema [50] and REgular LAnguage for XML Next Generation (RELAX NG) [34]), and JSON Schema [44, 60].
5.1 Existing Graph Schema Features
5.2 Support of the Features
Conceptual data models. ER-based data models tend to be agnostic with respect to attribute types, since these may depend on the back-end for which the data model is designed. Most support inheritance hierarchies and, in that way, can model union and intersection types. Entity types can be modelled as abstract types, by indicating that its entities must belong to at least one of its sub-types. Since the final goal is to design a relational schema, which is closed, none of them support open types. Most ER-based models allow attributes to be composed and/or multi-valued, and so can model complex nested values.
RDF formalisms. RDF-based formalisms inherit XML datatypes with some limitations (PDT). Both SHACL and ShEx are based on a kind of open semantics in which the closeness of a constraint needs to be specified with a keyword close (OCT). The element properties are expressible only over the nodes, except the recent proposal of RDF-star, that extends RDF exactly with properties over edges (EP). SHACL and ShEx are missing explicit support for key-like constraints (KC), but allow for cardinality constraints (CC), to which SHACL applies set and ShEx bag semantics. The complexity of validation (TV) of RDF-based formalisms is a well researched topic. While it is not tractable in general for the most expressive cases, practically useful fragments do have this property.
[RDF-Star não só permite adicionar propriedades a triplas como também permite aninhar as triplas]
5.3 Possible Extensions of PG-Schema
Range constraints. Some schema languages allow for range constraints (RC). The syntax of PG-Schema can be thus extended, specifying restrictions on acceptable values for properties.
6 SUMMARY AND LOOKING AHEAD
A PG-SCHEMA GRAMMAR
The core productions of the PG-Schema grammar in EBNF are presented in Figure 3. The nonterminals label, key, and propertyType are instantiated as string representations of labels (L), keys (K), and base property types (B).
B A GRAPH-BASED DATA CATALOG
An enterprise data catalog is a metadata management tool that companies use to inventory data resources and organize the data within their systems.
C EXISTING GRAPH SCHEMA LANGUAGES
Conceptual Data Models. By conceptual data models we mean here data models that aim to be conceptual in nature, i.e., closer and more faithful to the Universe of Discourse of the stakeholders than traditional database models, such as the relational model, aim to be. Typical examples of conceptual data models are the Entity-Relationship Model [17] and its various extensions which are usually described as EER models and include the Extended Entity-Relationship Model [54] and the various Enhanced Entity-Relationship Models that can be found in database text books such as [22]. Other conceptual data models that were inspired by ER but have some important differences are ORM2 diagrams [30] and UML Class diagrams when used for the conceptual perspective [24].
Typically conceptual data models are not used as native data models of DBMSs or data stores, but are mapped to the data model that is supported by the DBMS that is used in the implementation. Therefore the semantics of diagrams in ER, EER and ORM are usually defined in terms of mappings to the relational model.
While not a graph database technology, in the narrow sense of the word, GraphQL [23, 32] is based on a graph data model and schema formalism (SDL), consisting of a directed, edge-labeled multigraph. Its nodes are JSON-style objects, which contain a collection of fields, each with their own type and with values from back-end data stores (obtained using resolver functions). Its edges are specified by defining object fields, whose types are other object fields. Each object field has a mandatory name and type, as well as zero or more named arguments and directives (optional or required). An argument allows users to send data that can affect the outcome of GraphQL operations. When an argument is optional, it is possible to define a default value. A directive is an additional configuration that can extend each field with annotations. By default, every field in GraphQL is nullable, but users can change this using a non-null type modifier. GraphQL also supports interface and union types.
[Muuuuuito Estraaaaaaanho]
https://graphql.org/learn/thinking-in-graphs/
AgensGraph [2] is a multi-model database based on PostgresQL. It supports the property graph data model, alongside the relational one and JSON documents, as well as uniform querying through both the SQL and openCypher languages.
Neo4j [53] is a graph database that leverages the property graph model and its native Cypher query language. While considered to be a schema-free system, Neo4j allows users to enforce the following constraints: unique node property, node and relationship property existence, as well as node key constraints. It also has a mechanism for automatically inferring the schema of a graph instance that, itself, is seen as a property graph that can be queried.
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.