Foundations of Modern Query Languages for Graph Databases

Foundations of Modern Query Languages for Graph Databases - Leitura de Artigo III

Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan Reutter, and Domagoj Vrgoč. 2017. Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv. 50, 5, Article 68 (September 2018), 40 pages. DOI:https://doi.org/10.1145/3104031

4 NAVIGATIONAL QUERIES

While graph patterns allow for querying graph databases in a bounded manner, it is often useful to provide more flexible querying mechanisms that allow to navigate the topology of the data. One example of such a query is to find all friends-of-a-friend of some person in a social network... . Here we are not only interested in immediate acquaintances of a person, but also the people she might know through other people; namely, her friends-of-a-friend, their friends, and so on.

Queries such as the one above are called path queries, since they require us to navigate through the graph using paths of potentially arbitrary length. Path queries have long been established as the core of navigational querying in graphs by the research community

It is therefore natural to add path queries to basic graph patterns as the core of graph querying. We call such queries navigational queries.

4.1 Path Queries

Paths are the most basic navigational object in a graph database. The most fundamental type of path query is that of path existence, which asks if there is some directed path between two nodes in a property graph, irrespective of edge labels; in some cases, one or all such paths can be additionally returned. This is a foundational notion related to the problems of reachability and transitive closure in directed graphs

However, in practice, one often needs path queries that impose additional constraints on the path that is to be computed, such as restrictions on edge labels.

** E também nos atributos (par chave-valor das arestas) para filtrar o contexto **

Definition. We can define a path query as having the general form P =x α−→y, where α specifies conditions on the paths we wish to retrieve and x andy denote the endpoints of the path. The end-points x andy can be variables, or specific nodes, or a mix of both ... When used as a path constraint, a regular epression specifies all paths whose edge labels, when concatenated, form a word in the language of the regular expression. Intuitively speaking, regular expressions allow for concatenating paths, for applying a union/disjunction of paths, and for applying a path zero or many times. Path queries specified using regular expressions are commonly known as Regular Path Queries (RPQs).

where “·” denotes concatenation, to match nodes x andy such that x is a person andy is a post that is liked by a (transitive) friend-of-a-friend of x. Finally, we can apply a union of paths to match the liked or disliked posts of transitive friends-of-a-friend of x... where the “|” symbol here denotes a union.

However, there are various navigational operations not supported by RPQs that seem quite natural. RPQs are sometimes thus extended to allow further expressions. One such extension is to allow an inverse operator a− (for a in Lab) to specify the traversal of edges in a backwards direction, ... recently they have been implemented in various systems; for example, extensions of RPQs form the conceptual core of “property paths” in the SPARQL 1.1 standard

Evaluation. To define how path queries are evaluated we need to formalise the notion of a path over graph databases. In a property graphG, a path π is a sequence n1e1n2e2n3 ...nk−1ek−1nk , where k ≥ 1 and with each ei being an edge in G between ni and ni+1. The label of the path π, denoted Lab(π), is the concatenation of its edge labels, namely Lab(π)= a1a2 ...ak−1, where ai is the label of ei. ... To define paths in edge-labelled graphs we need to be more careful, since we do not have edge identifiers in this model, and thus we cannot give the same definition as before. Instead, we define a path π in an edge-labelled graph G as a sequence: n1a1n2a2n3 ...nk−1ak−1nk,where (ni,ai,ni+1) is an edge in G for all i <k. In this case the label is simply Lab(π)= a1a2 ...ak−1. As in the case of property graphs, a single node n forms a zero-length path with the label ε.

The evaluation of a path query P =x α−→y over G, denoted P(G), then consists of all paths in G whose label satisfies α.

As in the case of graph patterns, different practical considerations—for example, the possibility of having paths involving cycles—give rise to different semantics for the evaluation for path queries or, more specifically, for which paths are included in P(G). (i) Arbitrary path semantics (ii) Shortest path semantics (iii) No-repeated-node semantics (simple paths) (iv) No-repeated-edge semantics

Output. As hinted at previously, a user may have different types of questions with respect to the paths contained in the evaluation P(G), such as: Does there exist any such path? Is a particular path π contained in P(G)? What are the pairs of nodes connected by a path in P(G)? What are (some of) the paths in P(G)? We can categorise such questions by what they return as results: (i) Boolean(true / false) (ii) Nodes (iii) Nodes (iv) Graphs

While the first two types of answers can be handled under, for example, a standard relational algebra, there is currently no consensus on how to represent paths as the output of a query. In particular, unlike solutions to graph patterns that have a fixed-arity output, paths do not have a fixed-arity, therefore we cannot directly define a mapping from variables to constants as in the case of a bgp match. Likewise, although returning graphs as queries is supported in SPARQL (Harris and Seaborne 2013) through CONSTRUCT, graph creation is only supported as a final step, where such graphs cannot be manipulated further by other operators.

Sets vs. bags. In the case of queries that return a boolean value or a graph as a result, there is no distinction between bag or set semantics. Likewise, in the case that full paths—that is, the complete sequence of nodes and edges in each path—are returned, no duplicates can occur and there is no such distinction. However, if nodes are returned, or nodes/edges are projected from a full path, then bag semantics are distinguished from set semantics. In particular, if we consider the case where we are returning end nodes of our path as output, when using set semantics, a pair (n,n′) will be returned exactly once when there is at least one path in P(G)connecting n with n′, and zero times otherwise; when using bag semantics, this now changes, and a pair (n,n′) is returned once for each full path in P(G) connecting n with n′.

4.2 Adding Paths to Basic Graph Patterns

Now that we understand how path queries can be used to match paths and how graph patterns can be used to match sub-graphs, we can combine them to produce a powerful query language that allows to find more flexible matches. In particular, this language allows to express that some edges in a graph pattern should be replaced by a path (satisfying certain conditions) instead of a single edge.

Combining path queries with basic graph patterns (bgps) gives rise to navigational graph patterns (ngps). In the case of edge-labelled graphs, ngps are defined similarly as bgps: namely, they are edge-labelled graphs where nodes can be constants or variables, and the edge labels can be constants, variables, RPQs,10 or the special symbol ∗denoting an arbitrary path.

Navigational graph patterns have received a lot of attention in the theoretical literature under the name conjunctive regular path queries (CRPQs) ... A natural extension of ngps is to consider complex navigational graph patterns (cngps) by taking the closure of ngps under the relational operations of selection, projection, join, union, difference, and optional, ...

For the navigational languages we have seen thus far, paths are the only form of recursion allowed. However, to express certain types of queries, we may require more expressive forms of recursion.

4.3 Navigational Queries in Practice

SPARQL. Since version 1.1 (Harris and Seaborne 2013), SPARQL permits the use of property paths, which are an extended form of regular expression that, beyond usual RPQs, also allow inverses and a limited form of negation (Kostylev et al. 2015). As a consequence, we can express any path query from Example 4.2 using SPARQL 1.1.

In one aspect, SPARQL goes beyond RPQs and allows for a (very) limited form of negation called negated property sets Kostylev et al. (2015). This is done by allowing subexpressions of the form !{e1,...,en}inside property paths, which will match to all pairs of nodes connected by some edge whose label is not in the set {e1,...,en}. Apart from ordinary labels, negated property sets can also include inverse-edge labels.

As aforementioned in the discussion on set vs. bag semantics in Section 4.1, in a draft of the SPARQL 1.1 standard, the original semantics of property paths was based on simple paths with a bag semantics. However, since it was shown that such a semantics quickly renders query evaluation impractical (Arenas et al. 2012; Losemann and Martens 2013), the semantics was changed. Now, to evaluate any query containing the transitive closure operator (∗or +), SPARQL uses a set semantics, looking for pairs of nodes connected by any path whose label belongs to the language of the regular expression specifying the query. Otherwise, if a property path can be rewritten as a bgp (with projection), SPARQL instead uses the bag semantics defined for bgps (see (Harris and Seaborne 2013, Section 9.3) for more details).

Similarly, SPARQL can also express navigational graph patterns (ngps).

Likewise, SPARQL can express complex navigational graph patterns (cngps).

Cypher. While not supporting full regular expressions, Cypher still allows transitive closure over a single edge label in a property graph. On the other hand, since it is designed to run over property graphs, Cypher also allows the star to be applied to an edge property/value pair; however, this is again limited to a single repeated label/value.

Recall that Cypher uses the no-repeated-edge semantics for cgps; by default, Cypher uses the same semantics for path queries, thus returning all pairs of nodes connected by a path that does not repeat any edges. In fact, Cypher uses a bag semantics, so each pair of nodes will be duplicated for every such path connecting them in the data.

However, Cypher also allows for returning a single shortest path connecting two nodes, or all shortest paths connecting them, allowing the user to declaratively change the semantics for evaluating paths within the query.

MATCH ( julie:Person {firstname:"Julie"} ),
p = shortestPath( (julie) -[:knows*]-> (x:Person) )
RETURN p

Apart from (a restricted form) of RPQs and (c)ngps, Cypher also offers several unique features that make it useful when working with property graphs. First, Cypher allows for specifying the length of the path.

Another interesting feature available in Cypher is the ability to return paths.

Gremlin. Gremlin supports navigation by the use of repeat, which enables arbitrary or fixed iteration of any graph traversal. As per SPARQL, Gremlin uses the arbitrary path semantics for navigational queries. However, unlike SPARQL, Gremlin returns bags and not sets of answers.For a fixed-length iteration, we can use repeat and specify the number of times the repetition should be performed.

G.V().hasLabel('Person').has('name','Clint Eastwood').repeat(out('acts_in').hasLabel('Movies').in('acts_in').hasLabel('Person')).times(2)

If we want arbitrary traversal, then we can simply omit the times command; however, this effectively means iterate an unbounded number of times, and consequently we may never get anything out of this traversal.

Finally, Gremlin also supports returning complete paths as results.

4.4 Complexity of Evaluating Navigational Queries

Path Queries. We concentrate on the complexity of evaluating RPQs, which has received considerable attention in the theoretical literature. This is relevant, since RPQs form the basis of many path query languages.

Arbitrary paths: Determining whether v can be reached from u by a path labelled in the regular expression L can be solved in linear time O(|G|·|L|)

Shortest paths: Applying reachability techniques that return shortest paths (e.g., breadth-first search) in combination with the previous automata-based algorithms, we obtain shortest paths witnessing the constraints stated by RPQs. In particular, computing the set of all pairs of nodes that are linked by a path labelled in L, and for each such pair a shortest path in G witnessing it, can be done in time O(|G|2 ·|L|)

No-repeated-node/edge paths: Under such semantics, the complexity jumps: evaluation becomes NP-complete even in data complexity (Mendelzon and Wood 1989). Tractable instances of the RPQ evaluation problem under these semantics can be found by either restricting regular expressions or the class of graph databases (Mendelzon and Wood 1989; Bagan et al. 2013), but it remains to be seen to what extent such restrictions are relevant in practice. The special case of Q =x ∗−→y can still be computed efficiently, since any shortest path needs to be simple, and thus finding an unconstrained simple path amounts to finding a shortest path.

In summary, finding nodes connected by arbitrary paths or finding a shortest path satisfying an RPQ can be done in polynomial time, whereas considering simple paths, the problem becomes intractable. An open question then is if there are any practical scenarios in which the (intractable) simple path witness is really justified in terms of computational cost over finding (tractable) witnesses based on shortest paths.

Navigational Graph Patterns. Recall that an ngp is a bgp where the edges can also be labelled by an RPQ, or the special symbol ∗denoting an arbitrary path.

5 FINAL REMARKS

The choice of the appropriate semantics for each of these forms of queries has proven to be a non-trivial task, and there have been several proposals coming both from practice and from theory. For matching basic graph patterns we classified the main proposals for the semantics into two categories:
(1) homomorphism-based: matching the pattern onto a graph with no restrictions.
(2) isomorphism-based: one of the following restrictions is imposed on a match:
—no-repeated-anything: no part of a graph is mapped to two different variables,
—no-repeated-node: no node in the graph is mapped to two different variables,
—no-repeated-edge: no edges in the graph is mapped to two different variables.

This growing diversity of graph database technologies moreover suggests that the time may come for further standardisation of graph query languages. While SPARQL has been formally standardised for RDF databases and has been well studied in the literature, many implementers have opted for custom graph database solutions and engines with custom languages, such as Neo4j with Cypher. Likewise, ad hoc standards like Gremlin have emerged in recent years and have been implemented by multiple vendors. However, unlike SPARQL, the semantics and complexity of languages like Cypher and Gremlin have not been studied. Looking to the future, one can thus expect standardisation efforts to rigorously define and characterise the properties of a general graph query language.

Pesquisa de Doutorado da Veronica

Pesquisar este blog

Foundations of Modern Query Languages for Graph Databases - Leitura de Artigo III

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

Exploratory Search: From Finding to Understanding - Leitura de Artigo