Creating and Querying Personalized Versions of Wikidata on a Laptop
Hans Chalupsky, Pedro Szekely, Filip Ilievski, Daniel Garijo, Kartik Shenoy
ABSTRACT: Application developers today have three choices for exploiting the knowledge present in Wikidata: they can download the Wikidata dumps in JSON or RDF format, they can use the Wikidata API to get data about individual entities, or they can use the Wikidata SPARQL endpoint. None of these methods can support complex, yet common, query use cases, such as retrieval of large amounts of data or aggregations over large fractions of Wikidata. This paper introduces KGTK Kypher, a query language and processor that allows users to create personalized variants of Wikidata on a laptop. We present several use cases that illustrate the types of analyses that Kypher enables users to run on the full Wikidata KG on a laptop, combining data from external resources such as DBpedia. The Kypher queries for these use cases run much faster on a laptop than the equivalent SPARQL queries on a Wikidata clone running on a powerful server with 24h time-out limits.
1 Introduction
Wikidata is also highly expressive, using a reification model where each statement includes qualifiers (e.g., to indicate temporal validity) and references (which provide the source(s) from which the statement comes from).
Contexto na Wikidata
Kypher, the query language and processor of the KGTK Knowledge Graph Toolkit: can be used to query {any} RDF KGs
2 Background
{KGTK} a comprehensive framework for the creation and exploitation of large hyper-relational KGs,
KGTK can process Wikidata-sized KGs, with billions of edges, on a laptop. Kypher (kgtk query) is one of 55 commands available in KGTK.
3 Kypher query language and processor
Kypher stands for KGTK Cypher. Cypher [2] is a declarative graph query language originally developed at Neo4j. OpenCypher5 is a corresponding open-source development effort for Cypher which forms the basis of the new Graph Query Language (GCL).
We chose Cypher since its ASCII-art pattern language makes it easy even for novices to express complex queries over graph data. Kypher adopts many aspects of Cypher’s query language, but has some important differences. Most notably, KGTK and therefore Kypher do not use the property graph data model assumed by Cypher.
To implement Kypher queries, we translate them into SQL and execute them on SQLite, a lightweight file-based SQL database.
4 Use Cases
Notebook em -> https://github.com/usc-isi-i2/kgtk-at-2021-wikidata-workshop/
https://github.com/usc-isi-i2/kgtk-notebooks
Datasets em -> https://zenodo.org/record/5139550
6 Discussion and Conclusions
The main objective of KGTK and Kypher is to democratize the exploitation of Wikidata so that anyone with modest computing resources can take advantage of the vast amounts of knowledge present in Wikidata. Our tools focus on use cases that use large portions of Wikidata to distill new knowledge.
Kypher is not meant to address use cases that require the most up-to-date data in Wikidata. KGTK uses the Wikidata JSON dumps published every few days, and the KGTK workflow to process the JSON dump takes one day.
The comparison with the Wikidata SPARQL endpoints is preliminary as we have not controlled for caching in the triple store and in the operating system, or performed systematic variations of the complexity of the queries.
As razões para o KGTK ter melhor desempenho nos testes listadas abaixo ainda são especulações:
1. Compact data model: the KGTK data model allows us to translate 1.2B Wikidata statements very directly into 1.2B edges, while the RDF translation requires reification and generates O (10B) triples. KGTK also does not require the use of namespaces which makes data values more compact.
Os namespaces poderiam ser fixos nos TripleStores para que o URI não fosse armazenado inteiro
2. Smaller database size: more compact data translates directly into smaller database sizes, for example, 142GB for the Kypher graph cache vs. 718GB for the local Wikidata endpoint. This gives generally better locality for table and index lookups and better caching of data pages.
Consequência do 1.
3. Specialized tables: representing specialized data slices such as P279star in their own graph tables makes their reuse very efficient and their indexes more focused, compact, and cache-friendly.
Particionamento
4. Read-only processing: Kypher does not need to support fine-grained updates of tables and indexes, which need to be supported by the public Wikidata endpoint. This requires additional machinery that slows down performance.
Read Only x Update: não tem bloqueio
5. Use case selection: triple stores and databases are optimized to support a large number of use cases. Our set of use cases samples a small slice of that space, and performance might be very different for other types of queries.
Não é benchmark
Baixei o código e estou baixando os dados dos casos de uso no meu notebook. Mas acabei de saber que o kgtk não funciona no Windows, somente no Linux e Mac e não tenho espaço na VM da cloud do DI para os dados. Por enquanto não poderei reproduzir ....
ResponderExcluirNos tempos do KGTK não foram computados as etapas de conversão do dump Wikidata em KGTK (que demoram 24 horas) e nem os tempos de geração dos arquivos em separado. As queries SPARQL na base local e na base pública rodam sobre a base completa e não previamente filtrada.
ResponderExcluir"The p31 file records the class of every instance, and the p279star file records
all the super-classes of every class using a new property called P279star. These
properties are commonly used so they are provided as separate files for the
convenience of users"