Pular para o conteúdo principal

Creating and Querying Personalized Versions of Wikidata on a Laptop - Leitura de Artigo

Creating and Querying Personalized Versions of Wikidata on a Laptop

Hans Chalupsky, Pedro Szekely, Filip Ilievski, Daniel Garijo, Kartik Shenoy

ABSTRACT: Application developers today have three choices for exploiting the knowledge present in Wikidata: they can download the Wikidata dumps in JSON or RDF format, they can use the Wikidata API to get data about individual entities, or they can use the Wikidata SPARQL endpoint. None of these methods can support complex, yet common, query use cases, such as retrieval of large amounts of data or aggregations over large fractions of Wikidata. This paper introduces KGTK Kypher, a query language and processor that allows users to create personalized variants of Wikidata on a laptop. We present several use cases that illustrate the types of analyses that Kypher enables users to run on the full Wikidata KG on a laptop, combining data from external resources such as DBpedia. The Kypher queries for these use cases run much faster on a laptop than the equivalent SPARQL queries on a Wikidata clone running on a powerful server with 24h time-out limits.

1 Introduction

Wikidata is also highly expressive, using a reification model where each statement includes qualifiers (e.g., to indicate temporal validity) and references (which provide the source(s) from which the statement comes from).

Contexto na Wikidata

Kypher, the query language and processor of the KGTK Knowledge Graph Toolkit: can be used to query {any} RDF KGs

2 Background

{KGTK} a comprehensive framework for the creation and exploitation of large hyper-relational KGs,  

KGTK can process Wikidata-sized KGs, with billions of edges, on a laptop. Kypher (kgtk query) is one of 55 commands available in KGTK.

3 Kypher query language and processor

Kypher stands for KGTK Cypher. Cypher [2] is a declarative graph query language originally developed at Neo4j. OpenCypher5 is a corresponding open-source development effort for Cypher which forms the basis of the new Graph Query Language (GCL).

We chose Cypher since its ASCII-art pattern language makes it easy even for novices to express complex queries over graph data. Kypher adopts many aspects of Cypher’s query language, but has some important differences. Most notably, KGTK and therefore Kypher do not use the property graph data model assumed by Cypher.

To implement Kypher queries, we translate them into SQL and execute them on SQLite, a lightweight file-based SQL database.

4 Use Cases

Notebook em -> https://github.com/usc-isi-i2/kgtk-at-2021-wikidata-workshop/

https://github.com/usc-isi-i2/kgtk-notebooks

Datasets em -> https://zenodo.org/record/5139550

6 Discussion and Conclusions

The main objective of KGTK and Kypher is to democratize the exploitation of Wikidata so that anyone with modest computing resources can take advantage of the vast amounts of knowledge present in Wikidata. Our tools focus on use cases that use large portions of Wikidata to distill new knowledge.

Kypher is not meant to address use cases that require the most up-to-date data in Wikidata. KGTK uses the Wikidata JSON dumps published every few days, and the KGTK workflow to process the JSON dump takes one day.

The comparison with the Wikidata SPARQL endpoints is preliminary as we have not controlled for caching in the triple store and in the operating system, or performed systematic variations of the complexity of the queries.       

As razões para o KGTK ter melhor desempenho nos testes listadas abaixo ainda são especulações:

1. Compact data model: the KGTK data model allows us to translate 1.2B Wikidata statements very directly into 1.2B edges, while the RDF translation requires reification and generates O (10B) triples. KGTK also does not require the use of namespaces which makes data values more compact.

A reificação em RDF aumenta o número de triplas (no mínimo 2x)
Os namespaces poderiam ser fixos nos TripleStores para que o URI não fosse armazenado inteiro

2. Smaller database size: more compact data translates directly into smaller database sizes, for example, 142GB for the Kypher graph cache vs. 718GB for the local Wikidata endpoint. This gives generally better locality for table and index lookups and better caching of data pages.

Consequência do 1. 

3. Specialized tables: representing specialized data slices such as P279star in their own graph tables makes their reuse very efficient and their indexes more focused, compact, and cache-friendly.

Particionamento

4. Read-only processing: Kypher does not need to support fine-grained updates of tables and indexes, which need to be supported by the public Wikidata endpoint. This requires additional machinery that slows down performance.

Read Only x Update: não tem bloqueio

5. Use case selection: triple stores and databases are optimized to support a large number of use cases. Our set of use cases samples a small slice of that space, and performance might be very different for other types of queries.

Não é benchmark   

Comentários

  1. Baixei o código e estou baixando os dados dos casos de uso no meu notebook. Mas acabei de saber que o kgtk não funciona no Windows, somente no Linux e Mac e não tenho espaço na VM da cloud do DI para os dados. Por enquanto não poderei reproduzir ....

    ResponderExcluir
  2. Nos tempos do KGTK não foram computados as etapas de conversão do dump Wikidata em KGTK (que demoram 24 horas) e nem os tempos de geração dos arquivos em separado. As queries SPARQL na base local e na base pública rodam sobre a base completa e não previamente filtrada.

    "The p31 file records the class of every instance, and the p279star file records
    all the super-classes of every class using a new property called P279star. These
    properties are commonly used so they are provided as separate files for the
    convenience of users"

    ResponderExcluir

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...