A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs

A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs - Leitura de Artigo

Ali, W., Saleem, M., Yao, B. et al. A survey of RDF stores & SPARQL engines for querying knowledge graphs. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00711-3

Abstract: This survey paper provides a comprehensive review of techniques and systems for querying RDF knowledge graphs. While other reviews on this topic tend to focus on the distributed setting, the main focus of the work is on providing a comprehensive survey of state-of-the-art storage, indexing and query processing techniques for efficiently evaluating SPARQL queries in a local setting (on one machine).

1 Introduction

Hundreds of SPARQL query services, called endpoints, have emerged on the Web [43], with the most popular endpoints receiving millions of queries per day [197,148]. We refer to engines that support storing, indexing and processing SPARQL (1.1) queries over RDF as SPARQL engines. Since SPARQL supports joins, we consider any SPARQL engine to also be an RDF store.

Join Processing. At the core of evaluating queries lie efficient methods for processing joins. Aside from traditional pairwise joins, recent years have seen the emergence of novel techniques, such as multiway and worst case optimal joins, as well as GPU-based join processing. Optimizing the order of evaluation of joins can also be important to ensure efficient processing.

2 Literature Review

Sakr et al. [196] present three schemes for storing RDF data in relational databases, surveying works that use the different schemes. Svoboda et al. [221] provide a brief survey on indexing schemes for RDF divided into three categories: local, distributed and global. Faye et al. [70] focus on both storage and indexing schemes for local RDF engines, divided into native and non-native storage schemes.
Luo et al. [141] also focus on RDF storage and indexing schemes under the relational-, entity-, and graph-based perspectives in local RDF engines.

3 Preliminaries

3.1 RDF

...

3.2 SPARQL

We define the core of SPARQL in terms of basic graph patterns that express the core pattern matched against an RDF graph; navigational graph patterns that match arbitrary length paths; complex graph patterns that introduce various language features, such as OPTIONAL, UNION, MINUS, etc. [16]; and query types that specify what result to return.

Basic Graph Patterns (BGPs) At the core of SPARQL lie triple patterns, which are RDF triples that allow variables from the set V (disjoint with IBL) in any position. A basic graph pattern (BGP) is a set of triple patterns. Since blank nodes in BGPs act as variables, we assume they have been replaced with variables.We use vars(B) to denote the set of variables in the BGP B. Given an RDF graph G, the evaluation of a BGP B, denoted B(G), returns a set of solution mappings. A solution mapping μ is a partial mapping from the set V of variables to the set of RDF terms IBL. We write dm(μ) to denote the set of variables for which μ is defined. Given a triple pattern t, we use μ(t) to refer to the image of t under μ, i.e., the result of replacing any variable v ∈ dm(μ) appearing in t with μ(v). μ(B) stands for the image of the BGP B under μ; i.e., μ(B) := {μ(t) | t ∈ B}. The evaluation of a BGP B on an RDF graphG is then given as B(G) := {μ | μ(B) ⊆ G and dm(μ) = vars(B)}. In the case of a singleton BGP {t}, we may write {t}(G) as t(G).

NavigationalGraph Patterns (NGPs) A key feature of graph query languages is the ability to match paths of arbitrary length [16]. In SPARQL (1.1), this ability is captured by property paths [92], which are regular expressions E that paths should match, defined recursively as follows:
– if p is an IRI, then p is a path expression (property);
– if e is a path expression, then ^e (inverse), e* (zero-or-more, aka. Kleene star), e+ (one-or-more), and e? (zero-or-one) are path expressions.
– If e1, e2 are path expressions, then e1/e2 (concatenation) and e1|e2 (disjunction) are path expressions.
– if P is a set of IRIs, then !P and !^P are path expressions (negated property set);

The evaluation of path expressions on an RDF graph G returns pairs of nodes in G connected by paths that match the expression, as defined in Table 2. These path expressions are akin to 2-way regular path queries (2RPQs) extended with negated property sets [128,16]. We call a triple pattern (s, e, o) that further allows a path expression as the predicate (i.e., e ∈ EV) a path pattern. A navigational graph pattern (NGP) is then a set of path patterns. Given a navigational graph pattern N, let paths(N) := p(N) ∩ E denote the set of path expressions used in N. Given an RDF graph G and a set of path expressions E ⊆ E, we denote by GE := G∪(Se∈E{(s, e, o) | (s, o) ∈ e(G)}) the result of materializing all paths matching E in G. The evaluation of the navigational graph pattern N on G is then N(G) :={μ | μ(N) ⊆ Gpaths(N) and dm(μ) = vars(N)}.

Complex Graph Patterns (CGPs) Complex graph patterns (CGPs) introduce additional language features that can combine and transform the results of one or more graph patterns. More specifically, evaluating BGPs and NGPs returns solution mappings that can be viewed as relations, (i.e., tables), where variables are attributes (i.e., column names) and tuples (i.e., rows) contain the RDF terms bound by each solution mapping (see Figures 2–4). CGPs support combining and transforming the results of BGPs/NGPs with language features that include FILTER (selection: σ), SELECT (projection: π), UNION (union: ∪), EXISTS (semi-join: ⋉), MINUS (anti-join: ⊲4) and OPTIONAL (left-join: ⊲⊳). These language features correspond to the relational algebra defined in Table 3. The default operator is a natural inner join (⊳⊲).

Named graphs SPARQL allows for querying multiple RDF graphs through the notion of a SPARQL dataset, defined as D := {G, (n1,G1), . . . , (nk,Gk))} where G,G1 . . . ,Gn are RDF graphs; n1, . . . , nk are pairwise distinct IRIs; G is known as the default graph; and each pair (n1,G1) (for 1 ≤ i ≤ n) is known as a named graph. Letting N′,N′′ denote sets of IRIs, n′, n′′ IRIs and v a variable, SPARQL then provides a number of features for querying different graphs:
– FROM N′ FROM NAMED N′′: activates a dataset with a default graph composed of the merge of all graphs G′ such that (n′,G′) ∈ D and n′ ∈ N′, and the set of all named graphs (n′′,G′′) ∈ D such that n′′ ∈ N′′;
– GRAPH n′: evaluates a graph pattern on the graph G′ if the named graph (n′,G′) is active;
– GRAPH v: takes the union of the evaluation of a graph pattern over each G′ such that (n′,G′) is active, binding v to n′ for each solution generated from G′;

Without FROM or FROM NAMED, the active dataset is the indexed dataset D. Without GRAPH, graph patterns are evaluated on the active default graph. Quad stores disallow empty named graphs, such that D := {G, (n1,G1), . . . , (nk,Gk))} is viewed as D = G×{⋆}∪(S(ni,Gi)∈D Gi ×{ni}), i.e., a set of quads using ⋆ 6∈ IBL as a special symbol for the default graph. In this case, a quad (s, p, o, n) denotes a triple (s, p, o) in the default graph if n = ⋆, or a triple in the named graph G′ such that (n,G′) ∈ D if n ∈ I. We can define CGPs involving quad patterns analogously.

4 Storage

4.1 Triple table

** Uma "tabela relacional" para todas as triplas ou quads **

A triple table stores an RDF graph G as a single ternary relation. Figure 1 shows an RDF graph with its triple table on the right-hand side. One complication when storing triple tables in relational databases is that such systems assume a column to have a single type, whichmay not be true for RDF objects in particular; a workaround is to store a string encoding of the terms, though this may complicate their ordering.

Rather than storing full RDF terms in the triple table, stores may apply dictionary encoding, where RDF terms are mapped one-to-one with numeric object identifiers (OIDs), with OIDs being stored in the table and decoded using the dictionary as needed. Since OIDs consume less memory and are faster to process than strings, such an approach works better for queries that involve many intermediate results but generate few final results; on the other hand, such an approach suffers when queries are simple and return many results, or when selective filters are specified that require decoding the term before filtering. To find a better trade-off, some RDF engines (e.g., Jena 2 [241]) only use OIDs for strings with lengths above a threshold.

The most obvious physical storage is to store triples contiguously (row-wise). This allows for quickly retrieving the full triples that match (e.g.) a given triple pattern. However, some RDF engines based on relational storage (e.g., Virtuoso [69]) rather use (or provide an option for) column-wise storage, where the values along a column are stored contiguously, often following a particular order. Such column-wise
storage allows for better compression, and for quickly reading many values from a single column.
Triple tables can be straightforwardly extended to quadtables in order to support SPARQL datasets [69,91].

4.2 Vertical partitioning

** Uma "tabela relacional" por predicado, se for possível definir um tipo de dado para o objeto por predicado, ou seja, definir o esquema antes, então resolve o problema da tabela única para as triplas **

The vertical partitioning approach [1] uses a binary relation for each property p ∈ p(G) whose tuples encode subject–object pairs for that property. ... Physical storage can again use OIDs, row-based or column-based storage, etc. When compared with triple tables, vertical partitioning generates relations with fewer rows, and more specific domains for columns ... However, triple patterns with variable predicates may require applying a union on all relations. Also, RDF graphs may have thousands of properties [233], which may lead to a schema with many relations. Vertical partitioning can be used to store quads by adding a Graph column to each table [69,91].

4.3 Extended vertical partitioning

S2RDF [204] uses extended vertical partitioning based on semi-join reductions (we recall from Table 3 that a semi-join M1⋉M2, aka. FILTER EXISTS, returns the tuples in M1 that are “joinable”with M2).

4.4 Property table

** Uma tabela com todas as propriedades para cada tipo de cada vértice ou aresta, ou seja, definir o esquema antes **

Property tables aim to emulate the n-ary relations typical of relational databases. A property table usually contains one subject column, and n further columns to store objects for the corresponding properties of the given subject. The subject column then forms a primary key for the table. The tables to define can be based on classes, clustering [184], coloring [36], etc., to group subjects with common properties.

Property tables can store and retrieve multiple triples with a given subject as one tuple (e.g., to find people with age < 30 and interest = :SW) without needing joins. Property tables often store terms of the same type in the same column, enabling better compression. Complications arise for multi-valued (. . . -to-many) or optional (zero-to-. . . ) properties.

4.5 Graph-based storage

While the previous three storage mechanisms rely on relational storage, graph-based storage is adapted specifically for the graph-based model of RDF. Key characteristics of such models that can be exploited for storage include the adjacency of nodes, the fixed arity of graphs, etc. Graphs have bounded arity (3 for triples, 4 for quads), which can be exploited for specialized storage.

5 Indexing

Indexing enables efficient lookup operations on RDF graphs (i.e., O(1) or O(log |G|) time to return the first result or an empty result). The most common such operation is to find triples that match a given triple pattern. However, indexes can also be used to match non-singleton BGPs (with more than one triple pattern), to match path expressions, etc.

5.1 Triple indexes

The goal of triple indexes is to efficiently find triples matching a triple pattern. Letting s, p, o denote RDF terms and s, p, o variables, there are 2^3 = 8 abstract patterns:

When a storage scheme such as vertical partitioning is used [1], only the five (??? não seriam quatro???) patterns where the predicate is constant can be efficiently supported (by indexing the subject and object columns). If the RDF graph is stored as a (binary) adjacency matrix for each property [21,163], again only constant-predicate patterns can be efficiently supported. ...

** (?s, p, ?o), (s, p, ?o), (?s, p, o), (s, p, o) **

Otherwise, in triple tables, or similar forms of graph based storage, all triple patterns can be efficiently supported with triple permutations.

5.2 Entity-based indexes

Entity-based indexes optimize graph patterns that “center on” a particular entity. BGPs can be reduced to joins over their triple patterns; for example, {(x, p, y), (y, q, z)}(G)={(x, p, y)}(G) ⋊⋉ {(y, p, z)}(G). Star joins are frequently found in BGPs, defined to be a join on a common subject, e.g., {(w, p, x), (w, q, y), (w, r, z)}. Star joins may sometimes also include S–O joins on the common variable, e.g., {(w, p, x), (w, q, y), (z, r,w)} [142]. Star joins retrieve data surrounding a particular entity (in this case w). Entity-based indexes permit efficient evaluation of such joins.

** Tipo particular de BGP, o Benchmark do artigo [Hernández et. al. 2016] explorou e chamou de floco de neve com profundidade d=1 e d=2 **

Property tables can enable efficient star joins so long as the relevant tables can be found efficiently and there are indexes on the relevant columns (e.g., for p, q and/or r).

Property tables are complicated by multi-valued properties, missing values, etc. A more flexible approach is to index signatures of entities, which are bit-vectors encoding the property–value pairs of the entity. One such example is the vertex signature tree of gStore [263], which encodes all outgoing (p, o) pairs for a given entity s into a bit vector akin to a Bloom filter, and indexes these bit vectors hierarchically allowing for fast, approximate containment checks that quickly find candidate entities for a subset of such pairs.GRaSS [142] further optimizes for star subgraphs that include both outcoming and incoming edges on entities, where a custom FDD-index allows for efficient retrieval of the subgraphs containing a triple that matches a triple pattern.

5.3 Property-based indexes

Returning to the star join {(w, p, x), (w, q, y), (w, r, z)}, another way to quickly return candidate bindings for the variablew is to index nodes according to their adjacent properties; then we can find nodes that have at least the adjacent properties p, q, r

5.4 Path indexes

A path join involves successive S–O joins between triple patterns; e.g., {(w, p, x), (x, q, y), (y, r, z)}, where the start and end nodes (w, z) may be variables or constants. While path joins have fixed length, navigational graph patterns may further match arbitrary length paths. A number of indexing approaches have been proposed to speed up querying paths.

A path can be seen as a string of arbitrary length; e.g., a path {(w, p, x), (x, q, y), (y, r, z)} can be seen as a string wpxqyrz$, where $ indicates the end of the string; alternatively, if intermediate nodes are not of importance, the path could be represented as the string wpqrz$. The Yaanii system [47] builds an index of paths of the form wpxqyrz$ that are clustered according to their template of the form
wpqrz$. Paths are then indexed in B+trees, which are partitioned by template. Fletcher et al. [72] also index paths in B+trees, but rather than partition paths, they apply a maximum length of at most k for the paths included. Text indexing techniques can also be applied for paths (viewed as strings).

Gubichev et al. [80] use a path index of directed graphs, called FERRARI [210], for each property in an RDF graph.First, a condensed graph is computed by merging nodes of strongly connected components into one “supernode”; adding an artificial root node (if one does not exist), the result is a directed acyclic graph (DAG) that preserves reachability. A spanning tree – a subgraph that includes all nodes
and is a tree – of the DAG is computed and labeled with its postorder. All subtrees thus have contiguous identifiers, where the maximum identifies the root; e.g., in Figure 14, the subtree at :AI has the interval [1, 3], where 3 identifies the root. Then there exists a (directed) path from x to y if and only if y is in the subtree interval for x. Nodes in a DAG may, however, be reachable through paths not in the spanning
tree.

5.5 Join indexes

The results of joins can also be indexed. Groppe et al. [78] proposed to construct 6 × 24 = 96 indexes for 6 types of non-symmetric joins between two triple patterns (S–S, S–P, S–O, P–P, P–O, O–O). Hash maps are used to cover the 24 permutations of the remaining elements (not considering the join variable). Given the high space cost, only frequently encountered joins are sometimes indexed [55,152].

5.6 Structural indexes

Another family of indexes – known as structural indexes [141] – rely on a high-level summary of the RDF graph.

5.7 Quad indexes

Most quad indexes follow the triple index scheme [243,94,91,69], extending it to add another element. The number of permutations then grows to 24 = 16 abstract index patterns, 4! = 24 potential permutations, ... A practical compromise is to maintain a selection of permutations that cover the most common patterns [69]; for example, a pattern (s, p, o, g) may be uncommon in practice and could be supported reasonably well by evaluating (e.g.) (s, p, o, g) and filtering on g = g.

Each graph is indexed by hashing all seven abstract patterns on triples with some constant, generating seven pattern vectors for each graph. For example, a triple (s, p, o) in a graph named g will be hashed as (s, p, o), (s, p, ?), (s, ?, o), (?, p, o), (s, ?, ?), (?, p, ?), (?, ?, o), where ? is an arbitrary fixed token, and each result will be added to one of seven pattern vectors for g for that abstract pattern. Basic graph patterns can be encoded likewise, where locality sensitive hashing is then used to group and retrieve similar pattern vectors for a given basic graph pattern.

5.8 Miscellaneous Indexing

...

5.9 Discussion

While indexing triples or quads is conceptually the most straightforward approach, a number of systems have shown positive results with entity- and property-based indexes that optimize the evaluation of star joins, path indexes that optimize the evaluation of path joins, or structural indexes that allow for identifying query-relevant regions of the graph. Different indexing schemes often have different time–space trade-offs: more comprehensive indexes enable faster queries at the cost of space and more costly updates.

** Trade off igual ao Relational, como esperado **

6 Join Processing

RDF stores employ diverse query processing strategies, but all require translating logical operators that represent the query, into “physical operators” that implement algorithms for efficient evaluation of the operation. The most important such operators – as we now discuss – are natural joins.

6.1 Pairwise join algorithms

... The relational algebra – including joins – can then be used to combine or transform the results of one or more BGPs, giving rise to CGPs. The core of evaluating graph patterns is thus analogous to processing relational joins. The simplest and most well-known such algorithms perform pairwise joins; for example, a pairwise strategy for computing {t1, . . . tn}(G) may evaluate ((t1(G) ⊳⊲ t2(G)) ⊳⊲ . . .) ⊳⊲ tn(G).

Without loss of generality,we assume a join of two graph patterns P1(G) ⊳⊲ P2(G), where the join variables are denoted by V = {v1, . . . , vn} = vars(P1) ∩ vars(P2). Well known algorithms for performing pairwise joins include (index) nested-loop joins, where P1(G) ⊳⊲ P2(G) is reduced to
evaluating Sμ∈P1(G){μ} ⊳⊲ μ(P2)(G); hash joins, where each solution μ ∈ P1(G) is indexed by hashing on the key (μ(v1), . . . , μ(vn)) and thereafter a key is computed likewise for each solution in P2(G) to probe the index with; and (sort-)merge joins, where P1(G) and P2(G) are (sorted if necessary and) read in the same order with respect to V , allowing the join to be reduced to a merge sort. Index nested loop joins tend to perform well when |P1(G)| ≪ |P2(G)| (assuming that μ(P2)(G) can use indexes) since it does not require reading all of P2(G). Otherwise hash or merge joins can perform well [168].

6.2 Multiway joins

Multiway join algorithms exploit the commutativity and associativity of joins to evaluate two or more operands at once. For example, in order to compute {t1, . . . tn}(G), a multiway join algorithm may evaluate (t1(G) ⊳⊲ . . . ⊳⊲ tk(G)) ⊳⊲ (tk+1(G) ⊳⊲ . . . ⊳⊲ tn(G)) where k ≥ 2, or it may even simply evaluate everything at once as (t1(G) ⊳⊲ . . . ⊳⊲ tn(G)).

Some of the previous storage and indexing schemes we have seen lend themselves naturally to processing certain types of multiway joins in an efficient manner. Entity-based indexes allow for processing star joins efficiently, while path indexes allow for processing path joins efficiently (see Section 5). A BGP can be decomposed into sub-BGPs that can be evaluated per the corresponding multiway join, with pairwise joins being applied across the sub-BGPs; ...

6.3 Worst case optimal joins

A new family of join algorithms have arisen due to the AGM bound [22], which puts an upper bound on the number of solutions that can be returned from relational join queries. The result can be adapted straightforwardly to the case of BGPs. Let B = {t1, . . . , tn} denote a BGP with vars(B) = V . Now define a fractional edge cover as a mapping λ : B → R[0,1] that assigns a real value in the interval [0, 1] to each triple pattern of B such that for all v ∈ V , it holds that Pt ∈ Bv λ(t) ≥ 1, where Bv denotes the set of triple patterns in B that mention v. The AGM bound tells us that if B has the fractional edge cover λ, then for any RDF graph it holds that |B(G)| ≤ Qn i=1 |ti(G)|(ti); this bound is “tight”.

** Não entendi o que seria AGM bound como complexidade do algoritmo de join **

Recently, join algorithms have been proposed that can enumerate the results for a BGP B over a graph G in time O(agm(B,G)), where agm(B,G) denotes the AGM bound of B over G. Since such an algorithm must at least spend O(agm(B,G)) time writing the results in the worst case, such algorithms are deemed worst-case optimal (wco) [169].

Though such algorithms were initially proposed in a relational setting [169,230], they have recently been adapted for processing joins over RDF graphs [115,99,163,19]. Note that traditional pairwise join algorithms are not wco. If we try to evaluate {t1, t2}(G) by pairwise join, for example, in order to later join it with t3(G), the AGM bound becomes quadratic as λ(t1) = λ(t2) = 1, and thus we have the bound
|t1(G)| · |t2(G)|, which exceeds the AGMbound for B.

Wco join algorithms – including Leapfrog Triejoin (LTJ) [230] – perform a multiway join that resolves a BGP B variable-by-variable rather than pattern-by-pattern. First an ordered sequence of variables is selected; say (?a, ?b, ?c). Then the set of partial solutions M{?a} = {μ | dm(μ) = {?a} and μ(B?a)(G) 6= ∅} are computed for the first variable ?a such that each image of B?a under μ ∈ M{?a} has some solutions forG; e.g.,M{?a} = {{?a/:DB}, {?a/:IR}, {?a/:SW}, {?a/:Web}} in Figure 18, since replacing ?a in B?a with :DB, :IR, :SW or :Web yields a BGP with solutions over G. Next we compute M{?a,?b} = {μ ∪ μ′ | μ ∈ M{?a}, dm(μ′) = {?b} and μ′(μ(B?b))(G) 6= ∅}, “eliminating” the next variable ?b. In the example of Figure 18, M{?a,?b} = {{?a/:DB, ?b/:CS}, . . . , {?a/:Web, ?b/:CS}}, where each solution μ ∈ M{?a} is extended with {?b/:CS}. Finally,M{?a,?b,?c} is computed analogously, eliminating the last variable, and yielding the five results seen in Figure 18.

** Esse algoritmo LF foi implementado no Millenium DB **

To be wco-compliant, the algorithm must always be able to efficiently computeM{v}, i.e., solutions μ with dm(μ) = {v}, such that μ(Bv)(G) 6= ∅. To compute M{?a} in the running example, we need to efficiently intersect all nodes with an outgoing s:b edge and an incoming s:r edge. This is typically addressed by being able to read the results of a triple pattern, in sorted order, for any variable,which enables efficient intersection by allowing to seek ahead to the maximum current value of all triple patterns involving a given variable. Jena-LTJ [99], which implements an LTJ-style join algorithm for SPARQL, enables this by maintaining all six index permutations over triples, while Ring [19] requires only one permutation.Wco algorithms often outperform traditional join algorithms for complex BGPs [115,99].

6.4 Translations to linear algebra

Per Section 4.6, dictionary-encoded RDF graphs are sometimes represented as a bit tensor, or as a bit matrix for each property (see Figure 10), etc. Viewed in this light, some query algebra can then be reduced to linear algebra [156]; for example, joins become matrix/tensor multiplication.

Translating joins into linear algebra enables hardware acceleration, particularly involving GPUs and HPC architectures, which can process tensors with high levels of parallelism.

6.5 Join reordering

The order of join processing can have a dramatic effect on computational costs.

A good plan depends not only on the query, but also the graph. Selecting a good plan thus typically requires some assumptions or statistics over the graph. As in relational settings, the most important information relates to cardinalities: how many (distinct) solutions a given pattern returns; and/or selectivity: what percentage of solutions are kept when restricting variables with constants or filters. Statistics can be used not only to select an ordering for joins, but also to decide which join algorithm to apply.

While cardinality and selectivity estimates can be managed in a similar way to relational database optimizers, a number of approaches have proposed custom statistics for RDF. Stocker et al. [215] collect statistics relating to the number of triples, the number of unique subjects, and for each predicate, the number of triples and a histogram of associated objects. RDF-3X [168] uses a set of aggregated indexes, which store the cardinality of all triple patterns with one or two constants. RDF-3X [168] further stores the exact cardinality of frequently encountered joins, while characteristic sets [165] and extended characteristic sets [155] (discussed in Section 5.3) capture the cardinality of star joins.
Computing and maintaining such statistics incur costs in terms of space and updates. An alternative is to apply sampling while evaluating the query.

Another alternative is to use syntactic heuristics for reordering. Stocker et al. [215] propose heuristics such as assuming that triple patterns with fewer variables have lower cardinality, that subject constants aremore selective than objects and predicates, etc. Tsialiamanis et al. [227] further propose to prioritize rarer joins (such as P–S and P–O joins), and to consider literals as more selective than IRIs. Taking into account such heuristics and statistics, the simplest strategy to try to find a good join ordering is to apply a greedymetaheuristic [215,155], starting with the triple pattern t1 estimated to have the lowest cardinality, and joining it with the triple pattern t2 with the next lowest cardinality; typically a constraint is added such that tn (n > 1) should have a variable in common with some triple pattern in {t1, . . . , tn−1} to avoid costly Cartesian products.

More generally, reordering joins is an optimization problem, where classical methods from the relational literature can be leveraged likewise for BGPs, including dynamic programming [209] (used, e.g., by [94,168,82]) and simulated annealing [106] (used, e.g., by [231]). Other metaheuristics that have been applied for join reordering in BGPs include genetic algorithms [102] and ant colony systems [101,114].

** Estatisticas e Heurísticas como no Relacional **

6.6 Caching

Another possible route for optimization – based on the observation that queries in practice may feature overlapping or similar patterns – is to reuse work done previously for other queries. Specifically, we can consider caching the results of queries. In order to increase cache hit rates, we can further try to reuse the results of subqueries, possibly generalizing them to increase usability. Ideally the cache should store solutions for subqueries that (a) have a high potential to reduce the cost of future queries; (b) can reduce costs for many future queries; (c) do not have a high space overhead; and (d) will remain valid for a long time.

** Como no Relacional **

6.7 Discussion

Techniques for processing BGPs are often based on techniques for processing relational joins. Beyond standard pairwise joins, multiway joins can help to emulate some of the benefits of property table storage by evaluating star joins more efficiently. Another recent and promising approach is to apply wco join algorithms whose runtime is bounded theoretically by the number of results that the BGP could generate.
....

7 Query Processing

7.1 Relational algebra (beyond joins)

Complex (navigational) graph patterns CGPs introduce additional relational operators beyond joins.
Like in relational databases, algebraic rewriting rules can be applied over CGPs in SPARQL to derive equivalent but more efficient plans.

7.2 Property paths

Navigational graph patterns (NGPs) extend BGPs with property paths, which are extensions of (2)RPQs that allow for matching paths of arbitrary length in the graph.

7.3 Recursion

Property paths offer a limited form of recursion. While extended forms of property paths have been proposed to include (for example) path intersection and difference [71], more general extensions of SPARQL have also been proposed to support graph-based and relation-based recursion.

7.4 Analytics

SPARQL engines often focus on transactional (OLTP) workloads involving selective queries that are efficiently solved through lookups on indexes. Recently, however, a number of approaches have looked at addressing analytical (OLAP) workloads for computing slices, aggregations, etc. [45].

One approach is to rewrite SPARQL queries to languages executable in processing environments suitable for analytical workloads, ... Conversely, one can also translate from analytical languages to SPARQL queries, allowing for in-database analytics, where analytical workloads are translated into queries run by the SPARQL engine/database.

There has also been growing interest in combining graph analytics – such as centrality measures, shortest paths, graph clustering, etc. – with SPARQL. In this way, SPARQL can be used as a declarative language to construct sub-graphs over which analytics are applied, and can further express queries involving the results of analytics. Unlike OLAP-style analytics, graph analytics often require recursion. One approach is to extend SPARQL to include imperative functions for invoking common graph algorithms.

7.5 Graph query rewriting

We have seen approaches that rewrite SPARQL queries into languages such as SQL [69,252], PigLatin [202,193], etc. Other works rewrite SPARQL into the query languages of (other) graph databases. SPARQL–Gremlin [223] rewrites SPARQL to Gremlin, allowing SPARQL queries to be evaluated on graph database engines that support Gremlin, while Semantic Property Graph [190] describes how reified RDF graphs can be projected into the property graph model supported by many graph database engines.

** A reificação somente pelo aspecto da linguagem e não do modelo **

7.6 Multi-query optimization

While the techniques discussed thus far optimize queries individually, multi-query optimization evaluates batches of queries efficiently by exploiting their commonalities. Le et al. [133] propose to first cluster a set of queries into groups with maximal common edge subgraphs; for example, the BGP {(w1, p, x1), (w1, q, y1), (w1, r, z1), (y1, s, z1)} and the BGP {(w2, p, x2), (w2, q, y2), (z2, r,w2)} may form a cluster. A query is then constructed for each cluster by extending its maximal common sub-BGP with optional patterns needed by a proper subset of the queries ...

7.7 Discussion

...

8 Partitioning

In distributed RDF stores and SPARQL engines, the data are partitioned over a cluster of machines in order to enable horizontal scale, where additional machines can be allocated to the cluster to handle larger volumes of data. However, horizontal scaling comes at the cost of network communication costs. Thus a key optimization is to choose a partitioning scheme that reduces communication costs by enforcing various forms of locality, principally allowing certain types of (intermediate) joins to be processed on each individual machine [8].

** Estratégias de particionamento horizontal em grafos **

8.1 Triple/Quad-based Partitioning

A first option is to partition based on individual triples or quads without considering the rest of the graph. For simplicity we will speak about triples as the discussion generalizes straightforwardly to quads. The simplest option is to use round robin or random partitioning, which effectively places triples on an arbitrary machine. This ensures even load balancing, but does not support any locality of processing, and does not allow for finding the particular machine storing triples that match a given pattern.

** Simples e ineficiente **

An alternative is to partition according to a deterministic function over a given key; for example, a partition key of S considers only the subject, while a partition key of PO considers both the predicate and object. Later given a triple pattern that covers the partition key (e.g., with a constant subject if the key is S), we can find the machine(s) storing all triples that match that pattern. .. Range-based partitioning assigns a range over the partition key to each function,where the example of Figure 20 splits S into [a:1,a:3], [a:4,a:6], [b:1,b:3], [c:1,d:1]. This approach allows for range-based queries to be pushed to one machine, but requires maintaining a mapping of ranges to machines, and can be complicated to keep balanced. An alternative is hash-based partitioning where we compute the hash of the partition key modulo the number of machines, where the second example of Figure 20 splits P by hash. This does not require storing any mapping, and techniques such as consistent hashing can be used to rebalance load when a machine enters or leaves; however, if partition keys are skewed (e.g., one predicate is very common), it may lead to an unbalanced partition.

** Mais complexo porém factível e passível a desbalanceamento. Implica em redistribuição quando uma máquina falhar **

A third option is to apply a hierarchical-based partition based on prefixes, where the third example of Figure 20 partitions O by their namespace. This may lead to increased locality of data with the same prefix [113], where different levels of prefix can be chosen to enable balancing, but choosing prefixes that offer balanced partitions is non-trivial.

8.2 Graph-based Partitioning

Graph-based partitioning takes into consideration the entire graph when computing a partition. A common strategy is to apply a k-way partition of the RDF graph G[119]. ... Edges between partitions may be replicated in the partitions they connect. Another alternative is to k-way partition the line graph of the RDF graph: an undirected graph where each triple is a node, and triples sharing a subject or object have an edge between them. Finding an optimal k-way partition is intractable, where approximations are thus necessary for large-scale graphs, ...

** Clusterizar/Agrupar em K grupos para k partições. Recalcular quando o grafo é modificado **

8.3 Query-based Partitioning

While the previous partitioning schemes only consider the data, other partitioning methods are (also) based on queries. Workload-based partitioning schemes identify common joins in query logs that can be used to partition or replicate parts of the graph in order to ensure that high-demand joins can be pushed to individual machines. Partitioning can then be a priori, for example, based on a query log; or dynamic (aka. adaptive), where the partitions change as queries are received.

** Particionamento baseado nos padrões de acesso **

8.4 Replication

Rather than partitioning data, data can also be replicated across partitions. This may vary from replicating the full graph on each machine, such that queries can be answered in full by any machine to increase query throughput (used, e.g., by DREAM [88]), to replicating partitions that are in highdemand (e.g., containing schema data, central nodes, etc.) so that more queries can be evaluated on individual machines and/or machines have equal workloads that avoid hot-spots (used, e.g., by Blazegraph [224] and Virtuoso [69]).

** Disponibilidade (particionamento) X Balanceamento de Carga (replicação) **

8.5 Discussion

....

9 Systems and Benchmarks

....

10.1 Current trends

Native storage techniques for graphs move away from relational-style schemata for RDF, and rather focus on optimizing for the compression and navigation of RDF as a graph, with techniques such as index-free adjacency, tensor based storage, and other graph-based representations. Indexing likewise has evolved to consider entity-based (i.e., nodebased) schemes, path indexes, and structural indexes based on summarizing the graph structure of RDF data.While join processing over RDF is still largely inspired by techniques for relational databases, algorithms based on sideways information passing, multi-way joins, worst-case optimal joins, etc., have been shown to work particularly well on RDF graphs (e.g., given their fixed arity). In terms of query processing, features such as property paths and graph-based recursion go beyond what is considered in typical relational database management, with increased attention being paid to supporting graph analytics in the RDF/SPARQL setting. Regarding modern hardware, following broader trends, many works now leverage NoSQL systems and distributed
processing frameworks in order to scale RDF stores across multiple machines and handle new types of workloads.

10.2 Research Challenges and Future Directions

Other challenges have only been occasionally or partially addressed by the literature, where we highlight:
Dynamics: Many of the surveyed works assume static data, and do not handle updates gracefully. Thus, more work is needed on efficiently querying dynamic RDF graphs with SPARQL, including storage that efficiently supports reads and writes, incremental indexing, caching, etc.

** KG não são estáticos, vide WD **

Query optimizations (beyond joins): Most works focus on optimizing joins and basic graph patterns. We found relatively few works optimizing features of SPARQL 1.1, such as property paths, negation, etc., where more work is needed.

Query volume: Leading SPARQL endpoints process millions of queries per day.

Evaluation: Various benchmarks are now available for comparing different RDF stores, but they tend to focus on system-level comparisons, thus conflating techniques. More fine-grained evaluation at the level of individual techniques in the RDF/SPARQL setting would be very useful to understand the different trade-offs that exist. Also many benchmarks were proposed for SPARQL 1.0, where there is a lack of benchmarks including features such as property paths.

Integration: ...More work is needed on supporting or integrating features for these tasks in SPARQL. Interesting questions relate to efficiently supporting RDFS/OWL/Datalog reasoning, graph algorithms, knowledge graph embeddings, graph neural networks, etc., for RDF graphs within SPARQL engines.

Pesquisa de Doutorado da Veronica

Pesquisar este blog

A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs - Leitura de Artigo

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

Exploratory Search: From Finding to Understanding - Leitura de Artigo