Pular para o conteúdo principal

Foundations of Modern Query Languages for Graph Databases - Leitura de Artigo III

Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan Reutter, and Domagoj Vrgoč. 2017. Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv. 50, 5, Article 68 (September 2018), 40 pages. DOI:https://doi.org/10.1145/3104031
 
4 NAVIGATIONAL QUERIES
 
While graph patterns allow for querying graph databases in a bounded manner, it is often useful to provide more flexible querying mechanisms that allow to navigate the topology of the data. One example of such a query is to find all friends-of-a-friend of some person in a social network... . Here we are not only interested in immediate acquaintances of a person, but also the people she might know through other people; namely, her friends-of-a-friend, their friends, and so on.
 
Queries such as the one above are called path queries, since they require us to navigate through the graph using paths of potentially arbitrary length. Path queries have long been established as the core of navigational querying in graphs by the research community
 
It is therefore  natural to add path queries to basic graph patterns as the core of graph querying. We call such queries navigational queries.
 
4.1 Path Queries
 
Paths are the most basic navigational object in a graph database. The most fundamental type of path query is that of path existence, which asks if there is some directed path between two nodes in a property graph, irrespective of edge labels; in some cases, one or all such paths can be additionally returned. This is a foundational notion related to the problems of reachability and transitive closure in directed graphs  
 
However, in practice, one often needs path queries that impose additional constraints on the path that is to be computed, such as restrictions on edge labels.  
 
** E também nos atributos (par chave-valor das arestas) para filtrar o contexto **
 
Definition. We can define a path query as having the general form P =x αy, where α specifies conditions on the paths we wish to retrieve and x andy denote the endpoints of the path. The end-points x andy can be variables, or specific nodes, or a mix of both ... When used as a path constraint, a regular epression specifies all paths whose edge labels, when concatenated, form a word in the language of the regular expression. Intuitively speaking, regular expressions allow for concatenating paths, for applying a union/disjunction of paths, and for applying a path zero or many times. Path queries specified using regular expressions are commonly known as Regular Path Queries (RPQs).
 
where “·” denotes concatenation, to match nodes x andy such that x is a person andy is a post that is liked by a (transitive) friend-of-a-friend of x. Finally, we can apply a union of paths to match the liked or disliked posts of transitive friends-of-a-friend of x... where the “|” symbol here denotes a union.
 
However, there are various navigational operations not supported by RPQs that seem quite natural. RPQs are sometimes thus extended to allow further expressions. One such extension is to allow an inverse operator a (for a in Lab) to specify the traversal of edges in a backwards direction, ... recently they have been implemented in various systems; for example, extensions of RPQs form the conceptual core of “property paths” in the SPARQL 1.1 standard 
 
Evaluation. To define how path queries are evaluated we need to formalise the notion of a path over graph databases. In a property graphG, a path π is a sequence n1e1n2e2n3 ...nk1ek1nk , where k 1 and with each ei being an edge in G between ni and ni+1. The label of the path π, denoted Lab(π), is the concatenation of its edge labels, namely Lab(π)= a1a2 ...ak1, where ai is the label of ei.   ... To define paths in edge-labelled graphs we need to be more careful, since we do not have edge identifiers in this model, and thus we cannot give the same definition as before. Instead, we define a path π in an edge-labelled graph G as a sequence: n1a1n2a2n3 ...nk1ak1nk,where (ni,ai,ni+1) is an edge in G for all i <k. In this case the label is simply Lab(π)= a1a2 ...ak1. As in the case of property graphs, a single node n forms a zero-length path with the label ε.
 
The evaluation of a path query P =x αy over G, denoted P(G), then consists of all paths in G whose label satisfies α.
 
As in the case of graph patterns, different practical considerations—for example, the possibility of having paths involving cycles—give rise to different semantics for the evaluation for path queries or, more specifically, for which paths are included in P(G). (i) Arbitrary path semantics (ii) Shortest path semantics (iii) No-repeated-node semantics (simple paths) (iv) No-repeated-edge semantics 
 
Output. As hinted at previously, a user may have different types of questions with respect to the paths contained in the evaluation P(G), such as: Does there exist any such path? Is a particular path π contained in P(G)? What are the pairs of nodes connected by a path in P(G)? What are (some of) the paths in P(G)? We can categorise such questions by what they return as results: (i) Boolean(true / false) (ii) Nodes (iii) Nodes (iv) Graphs
 
While the first two types of answers can be handled under, for example, a standard relational algebra, there is currently no consensus on how to represent paths as the output of a query. In particular, unlike solutions to graph patterns that have a fixed-arity output, paths do not have a fixed-arity, therefore we cannot directly define a mapping from variables to constants as in the case of a bgp match. Likewise, although returning graphs as queries is supported in SPARQL (Harris and Seaborne 2013) through CONSTRUCT, graph creation is only supported as a final step, where such graphs cannot be manipulated further by other operators.
 
Sets vs. bags. In the case of queries that return a boolean value or a graph as a result, there is no distinction between bag or set semantics. Likewise, in the case that full paths—that is, the complete sequence of nodes and edges in each path—are returned, no duplicates can occur and there is no such distinction. However, if nodes are returned, or nodes/edges are projected from a full path, then bag semantics are distinguished from set semantics. In particular, if we consider the case where we are returning end nodes of our path as output, when using set semantics, a pair (n,n) will be returned exactly once when there is at least one path in P(G)connecting n with n, and zero times otherwise; when using bag semantics, this now changes, and a pair (n,n) is returned once for each full path in P(G) connecting n with n.
 
4.2 Adding Paths to Basic Graph Patterns
 
Now that we understand how path queries can be used to match paths and how graph patterns can be used to match sub-graphs, we can combine them to produce a powerful query language that allows to find more flexible matches. In particular, this language allows to express that some edges in a graph pattern should be replaced by a path (satisfying certain conditions) instead of a single edge.
 
Combining path queries with basic graph patterns (bgps) gives rise to navigational graph patterns (ngps). In the case of edge-labelled graphs, ngps are defined similarly as bgps: namely, they are edge-labelled graphs where nodes can be constants or variables, and the edge labels can be constants, variables, RPQs,10 or the special symbol denoting an arbitrary path.  
 
Navigational graph patterns have received a lot of attention in the theoretical literature under the name conjunctive regular path queries (CRPQs) ...   A natural extension of ngps is to consider complex navigational graph patterns (cngps) by taking the closure of ngps under the relational operations of selection, projection, join, union, difference, and optional, ... 
 
For the navigational languages we have seen thus far, paths are the only form of recursion allowed. However, to express certain types of queries, we may require more expressive forms of recursion.
 
4.3 Navigational Queries in Practice
 
SPARQL. Since version 1.1 (Harris and Seaborne 2013), SPARQL permits the use of property paths, which are an extended form of regular expression that, beyond usual RPQs, also allow inverses and a limited form of negation (Kostylev et al. 2015). As a consequence, we can express any path query from Example 4.2 using SPARQL 1.1.
 
In one aspect, SPARQL goes beyond RPQs and allows for a (very) limited form of negation called negated property sets Kostylev et al. (2015). This is done by allowing subexpressions of the form !{e1,...,en}inside property paths, which will match to all pairs of nodes connected by some edge whose label is not in the set {e1,...,en}. Apart from ordinary labels, negated property sets can also include inverse-edge labels.
 
As aforementioned in the discussion on set vs. bag semantics in Section 4.1, in a draft of the SPARQL 1.1 standard, the original semantics of property paths was based on simple paths with a bag semantics. However, since it was shown that such a semantics quickly renders query evaluation impractical (Arenas et al. 2012; Losemann and Martens 2013), the semantics was changed. Now, to evaluate any query containing the transitive closure operator (or +), SPARQL uses a set semantics, looking for pairs of nodes connected by any path whose label belongs to the language of the regular expression specifying the query. Otherwise, if a property path can be rewritten as a bgp (with projection), SPARQL instead uses the bag semantics defined for bgps (see (Harris and Seaborne 2013, Section 9.3) for more details).
 
Similarly, SPARQL can also express navigational graph patterns (ngps).
 
Likewise, SPARQL can express complex navigational graph patterns (cngps).
 
Cypher. While not supporting full regular expressions, Cypher still allows transitive closure over a single edge label in a property graph. On the other hand, since it is designed to run over property graphs, Cypher also allows the star to be applied to an edge property/value pair; however, this is again limited to a single repeated label/value. 

Recall that Cypher uses the no-repeated-edge semantics for cgps; by default, Cypher uses the same semantics for path queries, thus returning all pairs of nodes connected by a path that does not repeat any edges. In fact, Cypher uses a bag semantics, so each pair of nodes will be duplicated for every such path connecting them in the data.
 
However, Cypher also allows for returning a single shortest path connecting two nodes, or all shortest paths connecting them, allowing the user to declaratively change the semantics for evaluating paths within the query.
 
MATCH ( julie:Person {firstname:"Julie"} ),
p = shortestPath( (julie) -[:knows*]-> (x:Person) )
RETURN p
 
Apart from (a restricted form) of RPQs and (c)ngps, Cypher also offers several unique features that make it useful when working with property graphs. First, Cypher allows for specifying the length of the path. 
 
Another interesting feature available in Cypher is the ability to return paths.
 
Gremlin. Gremlin supports navigation by the use of repeat, which enables arbitrary or fixed iteration of any graph traversal. As per SPARQL, Gremlin uses the arbitrary path semantics for navigational queries. However, unlike SPARQL, Gremlin returns bags and not sets of answers.For a fixed-length iteration, we can use repeat and specify the number of times the repetition should be performed. 
 
G.V().hasLabel('Person').has('name','Clint Eastwood').repeat(out('acts_in').hasLabel('Movies').in('acts_in').hasLabel('Person')).times(2)
 
If we want arbitrary traversal, then we can simply omit the times command; however, this effectively means iterate an unbounded number of times, and consequently we may never get anything out of this traversal.  
 
 
 
Finally, Gremlin also supports returning complete paths as results.
   
4.4 Complexity of Evaluating Navigational Queries
 
Path Queries. We concentrate on the complexity of evaluating RPQs, which has received considerable attention in the theoretical literature. This is relevant, since RPQs form the basis of many path query languages. 
 
Arbitrary paths: Determining whether v can be reached from u by a path labelled in the regular expression L can be solved in linear time O(|G|·|L|)
 
Shortest paths: Applying reachability techniques that return shortest paths (e.g., breadth-first search) in combination with the previous automata-based algorithms, we obtain shortest paths witnessing the constraints stated by RPQs. In particular, computing the set of all pairs of nodes that are linked by a path labelled in L, and for each such pair a shortest path in G witnessing it, can be done in time O(|G|2 ·|L|)
 
No-repeated-node/edge paths: Under such semantics, the complexity jumps: evaluation becomes NP-complete even in data complexity (Mendelzon and Wood 1989). Tractable instances of the RPQ evaluation problem under these semantics can be found by either restricting regular expressions or the class of graph databases (Mendelzon and Wood 1989; Bagan et al. 2013), but it remains to be seen to what extent such restrictions are relevant in practice. The special case of Q =x y can still be computed efficiently, since any shortest path needs to be simple, and thus finding an unconstrained simple path amounts to finding a shortest path.
 
In summary, finding nodes connected by arbitrary paths or finding a shortest path satisfying an RPQ can be done in polynomial time, whereas considering simple paths, the problem becomes intractable. An open question then is if there are any practical scenarios in which the (intractable) simple path witness is really justified in terms of computational cost over finding (tractable) witnesses based on shortest paths.
 
Navigational Graph Patterns. Recall that an ngp is a bgp where the edges can also be labelled by an RPQ, or the special symbol denoting an arbitrary path. 
 
5 FINAL REMARKS
 
 
The choice of the appropriate semantics for each of these forms of queries has proven to be a non-trivial task, and there have been several proposals coming both from practice and from theory. For matching basic graph patterns we classified the main proposals for the semantics into two categories:
(1) homomorphism-based: matching the pattern onto a graph with no restrictions.
(2) isomorphism-based: one of the following restrictions is imposed on a match:
no-repeated-anything: no part of a graph is mapped to two different variables,
no-repeated-node: no node in the graph is mapped to two different variables,
no-repeated-edge: no edges in the graph is mapped to two different variables.
 

This growing diversity of graph database technologies moreover suggests that the time may come for further standardisation of graph query languages. While SPARQL has been formally standardised for RDF databases and has been well studied in the literature, many implementers have opted for custom graph database solutions and engines with custom languages, such as Neo4j with Cypher. Likewise, ad hoc standards like Gremlin have emerged in recent years and have been implemented by multiple vendors. However, unlike SPARQL, the semantics and complexity of languages like Cypher and Gremlin have not been studied. Looking to the future, one can thus expect standardisation efforts to rigorously define and characterise the properties of a general graph query language. 
 
    
 
   
  
 

Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...