Jaroslav Pokorný;
IFIP International Federation for Information Processing 2015 K. Saeed and W. Homenda (Eds.): CISIM 2015, LNCS 9339, pp. 58–69, 2015.
DOI: 10.1007/978-3-319-24369-6_5
Definitions
A graph database is any storage system that uses graph structures with nodes and edges, to represent and store data. Both nodes and edges are defined by a unique identifier.
The property graph model concerns data structure called in graph theory labelled and directed attributed multigraphs.
A hypergraph is a generalization of the concept of a graph, in which the edges are substituted by
hyperedges. If a regular edge connects two nodes of a graph, then a hyperedge connects an arbitrary set of nodes.
Graph Database Technology: Graph Storage, Graph Querying, Scalability, Transaction Processing
Graph databases provide native processing capabilities, at least a property called index-free adjacency, meaning that every node is directly linked to its neighbour node. A database engine that utilizes index-free adjacency is one in which each node maintains direct references to its adjacent nodes; each node, therefore acts as an index of other nearby nodes, which is much cheaper than using global indexes.
As more complex queries we meet very often subgraph and supergraph queries. They belong to rather traditional queries based on exact matching.
Other typical queries include breadth-first/depth-first search, path and shortest path finding, finding cliques or dense subgraphs, finding strong connected components, etc. Algorithms used for such complex queries need often iterative computation.
In Big Graphs often approximate matching is needed. Allowing structural relaxation, then we talk about structural similarity queries.
>> Schema-less, queries diferentes e equivalentes para a mesma consulta, UNION
Sharding (or graph partitioning) is crucial to making graphs scale. Scaling graph data by distributing it across multiple machines is much more difficult than scaling the simpler data in other NoSQL databases, but it is possible.
The reason is the very nature way the graph data is connected. When distributing a graph, we want to avoid having relationships that span machines as much as possible; this is called the minimum point-cut problem. But what looks like a good distribution one moment may no longer be optimal a few seconds later. Typically, graph partition problems fall under the category of NP-hard problems.
Scaling is usually connected with three things: (1) scaling for large datasets, (2) scaling for read performance, and (3) scaling for write performance.
Scaling for writes can be accomplished by scaling vertically, but at some point, for very heavy write loads, it requires the ability to distribute the data across multiple machines. This is the real challenge. For example, Titan is a highly scalable OLTP graph database system optimized for thousands of users concurrently accessing and updating one Big Graph.
Categories of Graph Databases, Triplestores.
GraphDB™ is a RDF triple store that can perform semantic inferring at scale allowing users to create new semantic facts from existing facts. GraphDB™ is built on OWL (Ontology Web Language). It uses ontologies that allow the repository to automatically reason about the data. AlegroGraph also supports reasoning and ontology modelling.
https://www.ontotext.com/products/graphdb/
A list of requirements often required by customers considering a triple store is introduced in [10]:
- inferring,
- integration with text mining pipelines,
- scalability,
- extensibility,
- enterprise resilience,
- data integration and identity resolution,
- semantics in the cloud,
- semantic expertise.
LIMITATIONS
Declarative querying: Most commercial graph databases cannot be queried using a declarative language. This implies also a lack of query optimization abilities.
Data partitioning: Most graph databases do not include the functionality to partition and distribute data in a computer network. It is difficult to partition a graph in a way that would not result in most queries having to access multiple partitions.
Model restrictions: Possibilities of data schema and constraints definitions are restricted in graph databases. Therefore, data inconsistencies can quickly reduce their usefulness.
>> SPIN e SHACL para TripleStores
Graph algorithms: More complex graph algorithms are needed in practice. The ideal graph database should understand analytic queries that go beyond k-hop queries for small k. Authors of [9] did a performance comparison of 12 open source graph databases using four fundamental graph algorithms (e.g. simple source shortest path problem and Page Rank) on networks containing up to 256 million edges. Surprisingly, the most popular graph databases have reached the worst results in these tests. Current graph databases (like relational databases) tend to prioritize low latency query execution over high-throughput data analytics.
Parallelisation: In the context of Big Graphs there is a need for parallelisation of graph data processing algorithms when the data is too big to handle on one server.
Heterogeneous and uncertain graph data: There is a need to find automated methods of handling the heterogeneity, incompleteness and inconsistency between different Big Graph data sets that need to be semantically integrated in order to be effectively queried or analysed.
Design of graph databases: Similarly to traditional databases, some attempts to develop design models and tools occur in last time. In [3], the authors propose a model-driven, system-independent methodology for the design of graph databases starting from ER-model conceptual schema.
3. De Virgilio, R., Maccioni, A., Torlone, R.: Model-driven design of graph databases. In: Yu, E., Dobbie, G., Jarke, M., Purao, S. (eds.) ER 2014. LNCS, vol. 8824, pp. 172–185. Springer, Heidelberg (2014)
>> Apareceu na minha Revisão Sistemática
Need for a benchmark: Querying graph data can significantly depend on graph properties. The benchmarks built, e.g., for RDF data are mostly focused on scaling and not on querying. Also benchmarks covering a variety of graph analysis tasks would help towards evaluating and comparing the expressive power and the performance of different graph databases and frameworks.
Developing heuristics for some hard graph problems: For example, partitioning of large-scale dynamic graph data for efficient distributed processing belongs among these problems, given that the classical graph partitioning problem is NP-hard.
Graph pattern matching: New semantics and algorithms for graph pattern matching over distributed graphs are in development, given that the classical subgraph isomorphism problem is NP-complete.
Compressing graphs: Compressing graphs for matching without decompression is possible. Combining parallelism with compressing or partitioning is also very interesting.
Integration of graph data: In the context of Big Data, query formulation and evaluation techniques to assist users querying heterogeneous graph data are needed.
Visualization: Improvement of human-data interaction is fundamental, particularly a visualization of large-scale graph data, and of query and analysis results.
Graph streams processing: Developing algorithms for processing Big Graph data streams with goal to compute properties of a graph without storing the entire graph.
Li o artigo novamente para identificar potenciais problemas de pesquisa: Modelagem, Restrições de Integridade e Regras de Negócio ainda estão no meu foco.
ResponderExcluirE com SPIN/SHACL podem ser atendidas usando uma Ontologia que descreva o modelo de dados presente no TripleStore. Além disso novos fatos podem ser inferidos (BD dedutivo)
Excluir