Foundations of Modern Query Languages for Graph Databases

Foundations of Modern Query Languages for Graph Databases - Leitura de Artigo II

Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan Reutter, and Domagoj Vrgoč. 2017. Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv. 50, 5, Article 68 (September 2018), 40 pages. DOI:https://doi.org/10.1145/3104031

3 GRAPH PATTERNS

One of the earliest such languages to be adopted by multiple vendors—for the purposes of querying RDF graphs was SPARQL (SPARQL Protocol and RDF Query Language), which was initially standardised by the W3C in 2008 (Prud’hommeaux and Seaborne 2008), with an updated version called SPARQL 1.1 published in 2013 (Harris and Seaborne 2013).

With respect to property graphs, perhaps the most well-known implementation thereof is the Neo4j engine, whose development team released a declarative query language called Cypher (The Neo4j Team 2016).

Another query language for property graphs is Gremlin (Apache TinkerPop 2017), which forms an important part of the Apache TinkerPop3 graph computing framework.

The simplest form of graph pattern is a basic graph pattern (BGP), which is a graph-structured query that should be matched against the graph database.

Additionally, basic graph patterns can be augmented with other (relational-like) features, such as projection, union, optional, and difference. These allow for refining what sorts of matches are allowed and, ultimately, what results are returned. We call basic graph patterns augmented with such features complex graph patterns.

3.1 Basic Graph Patterns

Basic graph patterns (bgps) follow the same structure as the type of graph database they are intended to query but instead of only allowing constants, basic graph patterns also permit variables. In other words, a bgp for querying an edge-labelled graph is just an edge-labelled graph where variables can now appear as nodes or edge labels; a bgp for querying property graphs is just a property graph where variables can appear in place of any constant.

A match for a bgp is a mapping from variables to constants such that when the mapping is applied to the bgp, the result is, roughly speaking, contained within the original graph database. The results for a bgp are then all mappings from variables in the query to constants in the database that comprise a match.

Evaluation. Evaluating a bgp Q against a graph database G corresponds to listing all possible matches of Q with respect to G.

In technical terms, a match h corresponds to a homomorphism from Q to G (see, e.g., Barceló (2013)), whereby multiple variables in Q can map to the same term in G ... it may be desirable to require that variables map to distinct terms, where these latter two matches would be dropped; in other words it may be desirable to restrict h to be an injective (i.e., one-to-one) mapping, in which case the matching process corresponds to the well-known notion of subgraph isomorphism

These preferences lead to different semantics for the evaluation of a bgp Q over a graph database G: (i) Homomorphism-based semantics; (ii) Isomorphism-based semantics

3.2 Complex Graph Patterns

In terms of traditional relational operations, basic graph patterns (bgps) cover the natural join, and selection based on equality (since constants can be embedded into a bgp). Complex graph patterns (cgps) extend bgps with further traditional relational operations—namely projection, union, difference, optional (aka left-outer-join), and filter (which covers selection).

Projection. We call the set of variables for which Q(G)potentially returns matches the output variables of the graph pattern Q (which is independent of G). For a bgp, this is always the set of all variables in a query. However, projection allows for selecting a subset of the output variables of a graph pattern as the new output variables: it allows for stating which variables are deemed relevant in the evaluation of a cgp.

Join. While the join of two bgps can be easily expressed as another bgp (under homomorphism-based semantics), more complex graph patterns or different semantics require the explicit use of this operator. This corresponds to the usual relational join (more specifically, a natural join) over the queries that are defined by two graph patterns Q1 and Q2. The output variables of this join corresponds to the union of the output variables of Q1 and Q2, and its evaluation contains all matches that can be obtained by joining a match in the evaluation of Q1 with a match in the evaluation of Q2.

Union and Difference. Let Q1 and Q2 be two graph patterns. The union of Q1 and Q2 is a complex graph pattern whose evaluation is defined as the union of the evaluations of Q1 and Q2;

Optional. This operator is based on the join of two graph patterns Q1 and Q2, but instead of dismissing those matches in the evaluation of Q1 that cannot be joined with a match in the evaluation of Q2, it keeps them in the result to maximise the amount of information retrieved. ... This feature is particularly useful when dealing with incomplete information, or in cases where the user may not know what information is available.

Filter. Users may wish to restrict the matches of a cgp over a graph database G based on some of the intermediate values returned using, for example, inequalities, or other types of expressions. ... Applying a filter over a graph pattern does not change its output variables. In general, the filter expression covers the usual conditions permitted by the selection relational operator, including inequalities; boolean connectives such as and, or, or not, and so on.

3.3 Graph Patterns in Practice

We now take a closer look at how graph patterns are applied in three practical query languages: SPARQL, Cypher, and Gremlin. We choose these languages because they are the most widely used query languages in practice but offer significant differences: SPARQL operates over RDF graphs; Cypher is designed to operate over property graphs as defined previously; meanwhile, Gremlin is more imperative in nature than the other two, and is geared more towards graph traversal than graph pattern matching.

SPARQL. SPARQL is a declarative language recommended by the W3C for querying RDF graphs Several triple patterns can be combined (conjunctively) into a basic graph pattern. On top of basic graph patterns, SPARQL also supports all of the complex graph pattern features discussed previously (and more besides). The evaluation of bgps in SPARQL is done following homomorphism-based bag semantics. ... Likewise, rather than operate over a single graph, SPARQL operates over collections of graphs, called “Named Graphs,” which allow for selecting customised partitions of the data over which queries should be executed.

Cypher. Cypher is a declarative language for querying property graphs that uses “patterns” as its main building blocks (The Neo4j Team 2016). Patterns are expressed syntactically following a “pictorial” intuition to encode nodes and edges with arrows between them. Unlike SPARQL, Cypher uses isomorphism-based no-repeated-edges bag semantics.

Example 3.11. The pattern in Figure 7(b) would be written in Cypher as

MATCH (x1:Person) -[:acts_in]-> (:Movie {title:"Unforgiven"}) <-[:acts_in]- (x2:Person)}

RETURN x1,x2

The MATCH clause specifies the bgp in question. Nodes are written inside “( ) ” brackets and edges inside “[ ] ” brackets. Filters for labels can be written after the node separated with a “:” symbol, such that (x1:Person) represents a node x1 that must match to a node labelled Person. Specific values for properties can be specified within “{ } ” brackets; for instance (:Movie {title:‘‘Unforgiven’’}) represents a node that must match to a node labelled Movie and that must have value Unforgiven for the property title. The RETURN clause can be used to project the output variables. Implicit projection is also allowed inside the pattern itself by simply omitting some of the variables; we have done this for the edges and the node with label Movie. If a variable x stores a node or edge id, then Cypher offers an “.” operator to refer to the value of some property of x. For instance, in our previous example, we can refer to the value of the property name for the variable x1 by using notation “x1.name” and thus use “RETURN x1.name,x2.name” to return the actors’ names (rather than their node ids).

Cypher supports union, difference, optional, and filter.

Gremlin. The last language we review is Gremlin: the query language of the Apache Tinker-Pop3 graph Framework (Apache TinkerPop 2017). Although Gremlin is also specified with the property graph model in mind, it differs quite significantly from the previous two declarative languages and has a more “functional” feel: while SPARQL and Cypher have obvious influences from SQL for example, Gremlin feels more like a programming language interface.7 Likewise, its focus is on navigational queries rather than matching patterns; however, amongst the “graph traversal” operations that it defines, we can find familiar graph pattern matching features. Similarly to SPARQL, Gremlin also uses the homomorphism-based bag semantics.

G.V().hasLabel('Person').has('name','Clint Eastwood') .out('acts_in').hasLabel('Movie')

The call G.V() will return the set of all nodes in the graph (V stands for “vertex”). We then apply two selections on the set of nodes, where the sequence of calls G.V().hasLabel('Person').has('name','Clint Eastwood') retrieves precisely those nodes with label Person and name Clint Eastwood. The command out('acts_in') retrieves all nodes that can be reached from these latter nodes with an edge labelled acts_in. Finally hasLabel('Movie') filters nodes not labelled with Movie.

Gremlin is most natural when expressing paths, because all such patterns can be simulated by a traversal on the graph.Nevertheless, Gremlin does include a way of specifying more general bgps (including branches and cycles): traversals are used to encode the structure, but nodes can be cross-referenced at different points using variables specified by means of the as command, while the pattern is then evaluated using the match command.

While Gremlin supports bgps, filters and projection, its main focus is on navigational queries, which will be discussed in Section 4. The current version has limited support for declarative-style operators for complex graph patterns. While a “union” command exists, and difference can be emulated by the “drop” command, the current version does not have explicit support for optional.

3.4 The Complexity of Evaluating Graph Patterns

To understand the computational complexity of working with a query language we consider the following evaluation problem: given a query Q in this language, a possible answer h and a graph database or property graph G, verify whether h is an answer to Q over G; that is, verify whether h ∈Q(G). The most basic fragment of graph query languages that needs to be considered is the fragment consisting of bgps and projection, which corresponds to conjunctive queries in relational databases (Abiteboul et al. 1995). The evaluation problem for this fragment is NP-complete for the homomorphism-based semantics and the three versions of the isomorphism-based semantics considered in Section 3.1; NP-hardness can be proven for the former by reduction from the graph homomorphism problem (Hell and Nesetril 2004), while for the latter, it can be established by reduction from the subgraph isomorphism problem (Ullmann 1976).

Furthermore, in practice, one is often interested in matching simple bgps that are not necessarily that difficult to evaluate. Both the graph theory and the database communities have dedicated vast amounts of work to identifying classes of patterns for which the matching problem can be efficiently solved, even in combined complexity.

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens: realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward) Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Pesquisa de Doutorado da Veronica

Pesquisar este blog

Foundations of Modern Query Languages for Graph Databases - Leitura de Artigo II

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

Exploratory Search: From Finding to Understanding - Leitura de Artigo