Graph Databases: Their Power and Limitations

Graph Databases: Their Power and Limitations - Leitura de Artigo

Jaroslav Pokorný;
IFIP International Federation for Information Processing 2015 K. Saeed and W. Homenda (Eds.): CISIM 2015, LNCS 9339, pp. 58–69, 2015.
DOI: 10.1007/978-3-319-24369-6_5

Definitions

A graph database is any storage system that uses graph structures with nodes and edges, to represent and store data. Both nodes and edges are defined by a unique identifier.

The property graph model concerns data structure called in graph theory labelled and directed attributed multigraphs.

A hypergraph is a generalization of the concept of a graph, in which the edges are substituted by
hyperedges. If a regular edge connects two nodes of a graph, then a hyperedge connects an arbitrary set of nodes.

Graph Database Technology: Graph Storage, Graph Querying, Scalability, Transaction Processing

Graph databases provide native processing capabilities, at least a property called index-free adjacency, meaning that every node is directly linked to its neighbour node. A database engine that utilizes index-free adjacency is one in which each node maintains direct references to its adjacent nodes; each node, therefore acts as an index of other nearby nodes, which is much cheaper than using global indexes.

As more complex queries we meet very often subgraph and supergraph queries. They belong to rather traditional queries based on exact matching.

Other typical queries include breadth-first/depth-first search, path and shortest path finding, finding cliques or dense subgraphs, finding strong connected components, etc. Algorithms used for such complex queries need often iterative computation.

In Big Graphs often approximate matching is needed. Allowing structural relaxation, then we talk about structural similarity queries.

>> Schema-less, queries diferentes e equivalentes para a mesma consulta, UNION

Sharding (or graph partitioning) is crucial to making graphs scale. Scaling graph data by distributing it across multiple machines is much more difficult than scaling the simpler data in other NoSQL databases, but it is possible.

The reason is the very nature way the graph data is connected. When distributing a graph, we want to avoid having relationships that span machines as much as possible; this is called the minimum point-cut problem. But what looks like a good distribution one moment may no longer be optimal a few seconds later. Typically, graph partition problems fall under the category of NP-hard problems.

Scaling is usually connected with three things: (1) scaling for large datasets, (2) scaling for read performance, and (3) scaling for write performance.

Scaling for writes can be accomplished by scaling vertically, but at some point, for very heavy write loads, it requires the ability to distribute the data across multiple machines. This is the real challenge. For example, Titan is a highly scalable OLTP graph database system optimized for thousands of users concurrently accessing and updating one Big Graph.

Categories of Graph Databases, Triplestores.

GraphDB™ is a RDF triple store that can perform semantic inferring at scale allowing users to create new semantic facts from existing facts. GraphDB™ is built on OWL (Ontology Web Language). It uses ontologies that allow the repository to automatically reason about the data. AlegroGraph also supports reasoning and ontology modelling.

https://www.ontotext.com/products/graphdb/

A list of requirements often required by customers considering a triple store is introduced in [10]:

inferring,
integration with text mining pipelines,
scalability,
extensibility,
enterprise resilience,
data integration and identity resolution,
semantics in the cloud,
semantic expertise.

LIMITATIONS

Declarative querying: Most commercial graph databases cannot be queried using a declarative language. This implies also a lack of query optimization abilities.

Data partitioning: Most graph databases do not include the functionality to partition and distribute data in a computer network. It is difficult to partition a graph in a way that would not result in most queries having to access multiple partitions.

Model restrictions: Possibilities of data schema and constraints definitions are restricted in graph databases. Therefore, data inconsistencies can quickly reduce their usefulness.

>> SPIN e SHACL para TripleStores

Graph algorithms: More complex graph algorithms are needed in practice. The ideal graph database should understand analytic queries that go beyond k-hop queries for small k. Authors of [9] did a performance comparison of 12 open source graph databases using four fundamental graph algorithms (e.g. simple source shortest path problem and Page Rank) on networks containing up to 256 million edges. Surprisingly, the most popular graph databases have reached the worst results in these tests. Current graph databases (like relational databases) tend to prioritize low latency query execution over high-throughput data analytics.

Parallelisation: In the context of Big Graphs there is a need for parallelisation of graph data processing algorithms when the data is too big to handle on one server.

Heterogeneous and uncertain graph data: There is a need to find automated methods of handling the heterogeneity, incompleteness and inconsistency between different Big Graph data sets that need to be semantically integrated in order to be effectively queried or analysed.

Design of graph databases: Similarly to traditional databases, some attempts to develop design models and tools occur in last time. In [3], the authors propose a model-driven, system-independent methodology for the design of graph databases starting from ER-model conceptual schema.

3. De Virgilio, R., Maccioni, A., Torlone, R.: Model-driven design of graph databases. In: Yu, E., Dobbie, G., Jarke, M., Purao, S. (eds.) ER 2014. LNCS, vol. 8824, pp. 172–185. Springer, Heidelberg (2014)

>> Apareceu na minha Revisão Sistemática

Need for a benchmark: Querying graph data can significantly depend on graph properties. The benchmarks built, e.g., for RDF data are mostly focused on scaling and not on querying. Also benchmarks covering a variety of graph analysis tasks would help towards evaluating and comparing the expressive power and the performance of different graph databases and frameworks.

Developing heuristics for some hard graph problems: For example, partitioning of large-scale dynamic graph data for efficient distributed processing belongs among these problems, given that the classical graph partitioning problem is NP-hard.

Graph pattern matching: New semantics and algorithms for graph pattern matching over distributed graphs are in development, given that the classical subgraph isomorphism problem is NP-complete.

Compressing graphs: Compressing graphs for matching without decompression is possible. Combining parallelism with compressing or partitioning is also very interesting.

Integration of graph data: In the context of Big Data, query formulation and evaluation techniques to assist users querying heterogeneous graph data are needed.

Visualization: Improvement of human-data interaction is fundamental, particularly a visualization of large-scale graph data, and of query and analysis results.

Graph streams processing: Developing algorithms for processing Big Graph data streams with goal to compute properties of a graph without storing the entire graph.

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens: realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward) Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Comentários

Veronica dos Santos30 de abril de 2021 às 10:13
Li o artigo novamente para identificar potenciais problemas de pesquisa: Modelagem, Restrições de Integridade e Regras de Negócio ainda estão no meu foco.
ResponderExcluir
Respostas

Adicionar comentário

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Pesquisa de Doutorado da Veronica

Pesquisar este blog