Pular para o conteúdo principal

Scaling Wikidata Query Service - WikidataCon 2021

Scaling Wikidata Query Service - unlimited access to all the world’s knowledge for everyone is hard

Speaker(s): Mike Pham, Guillaume Lederrey, Adam Shorland
Video: https://www.youtube.com/watch?v=oV4qelj9fxM

WDQS = Wikidata Query Service, the largest public SPARQL interface to Wikidata

Blazegraph -> 11 public servers in 2 data centers

Ref out/2021: 95 million Wikidata entities with 13.2 triples

"WDQS is hitting allocator limit on Blazegraph" - https://phabricator.wikimedia.org/T213210 

Blazegraph não tem suporte para fragmentação (sharding), são necessários cada vez mais discos maiores. O banco de dados se aproximou - e em alguns servidores, ultrapassou - o limite de alocação permitido no Blazegraph. A escalabilidade é exclusivamente vertical.

O Blazegraph não é mais mantido ativamente. From the chatbox and  https://github.com/blazegraph/database, https://en.wikipedia.org/wiki/Blazegraph It was developed by a company called SYSTAP which was acquired or the developers hired by Amazon. It's written in Java.  Source code is on github, but appears to be some years old; conceivably it is still developed within Amazon.

Last stable 2.1.5 / 19 March 2019 

From DB-Engines: "Amazon has acquired Blazegraph's domain and (probably) product. It is said that Amazon Neptune is based on Blazegraph."

Cenários catastróficos: não conseguir carregar mais dados ou não conseguir manter a interface de consulta atualizada em relação as alterações. Plano de Recuperação e Mitigação de Desastres.

Opções para reduzir a ocorrência e impacto dessas catastrofes: remover temporariamente dados ou dividir dados no Blazegraph, deixando-os no Wikidata para que não haja perda permanente. O rápido crescimento do Wikidata está causando esse problema.

short term strategies (mitigação): deletar da base de consulta e manter em dumps alguns dados pouco acessados. Por exemplo: vertical slicing = scholarly articles (40% of entities in Wikidata are scholarly articles, but they figure into only 2% of queries). horizontal slicing = non-english descriptions and external IDs

Se Wikidata transferir os dados de bibliografia para outra base / outro KG essa "pressão" diminui.

A federação é provavelmente a solução de longo prazo - isto é dividir permanentemente os dados em diferentes instâncias da Wikibase

WQDS query lag x update: how to balance?

Identificada a necessidade de uma "estratégia de escalonamento", ou seja, um plano / roteiro para o que escalar e quando, como e para quais propósitos

WQDS User Survey: perfil dos usuários via questionário e o que consideram mais importante. 

Tempo de timeout seria o mais impactante.

Boa parte dos usuários são da academia, seguido de corporações.

Discusses "freshness" of data.  Muitos usuários não precisam de "atualizações em tempo real", embora, quando questionados, digam que desejam isso. (Desejo x Necessidade, Ideal x Essencial)

An alternative is , that is some design that gave fewer timeouts but did not include recent edits in the answer set.

Uma alternativa são os "async results", ou seja, uma solução com mais tempo para consulta (menos tempos limite, timeout), mesmo sem incluir edições recentes no conjunto de respostas.

Usuários "satisfeitos" com SPARQL

From the chat box:  the underlying data is in BTree indexes with subject-object-predicate triples.  Plus size optimizations. ?????

Próximos passos: 

  1. identify and evaluate new back ends (bye-bye Blazegraph)
  2. take pressure off the system by creating alternative search services (like "REST API") that do not burden the systems as much
  3. make it easier to run other instances of Wikibase; make it easier to install and maintain

Survey for user feedback at tinyurl.com/WDQSsurvey ... sugeri a busca de propriedades por rótulo e descrição (EntitySearch)

Give feedback on new back end possibilities or criteria for them.  

- a list of criteria to evaluate graph backends for the purpose of scaling WDQS (T291207)

Meu comentário nesse item

Consider graph databases that support RDF-star and SPARQL-star such as RDF4J, AnzoGraph and GraphDB since they are proposed extensions to the RDF and SPARQL standards to provide a more convenient way to annotate RDF statements and to query such annotations (wikidata qualifiers and references), bridging the gap between the RDF world and the Property Graph world.

See W3C Draft Community Group Report 01 July 2021
https://www.w3.org/community/rdf-dev/2021/07/02/new-public-draft-of-the-rdf-star-report/

https://rdf4j.org/documentation/programming/rdfstar/
https://graphdb.ontotext.com/enterprise/devhub/rdf-sparql-star.html
https://cambridgesemantics.com/anzo-platform/

Meu comentário nesse item

Create a query set (extracted from the WDQS log) and a Wikidata subset of data to benchmark against graph databases (such as TPC).
Ask graph database vendors to test their products and publish the results to the community.

See
http://tpc.org/
https://github.com/socialsensor/graphdb-benchmarks

==================================================================

A lista de critérios para a escolha do BlazeGraph está nessa planilha, assim como os GraphDBs considerados na época (tem comentários de 2015)

https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0

Bancos de Dados listados na planilha, em itálico os que não receberam pontuação. Alguns conhecidos e outros sem referência:

OrientDB
Titan
Neo4j
ArangoDB
GraphX
WDQ    
InfoGrid    
BlazeGraph    
Accumulo    
Postgres    
Virtuoso    
4store    
Apache Jena    
Build ElasticGraph    
Build SQL + GraphServ 

Lista de requisitos (o peso variava de -10 a 30 sendo que alguns tinham peso 0 ????)

Project health (diversity of contributors)    30
Ability to maintain uptime    20
Experimentation console (repl for the graph)    10
Community Health (mailing list, irc, bugs and stuff)    10
Stability of storage layer    9
Handles ten thousand indexes natively (property values, qualifiers, etc)    9
Ability to handle lots of writes    9
Horizontal scalability    8
Query planner    8
Expressiveness of native query language    8
Maturity of distributed version    7
Simple index lookup (population = 101)    7
Indexed range lookup (populate > 100)    7
Geospatial indexes (withing 100 miles of the center of Berlin)    7
Traversal order rewriting    7
Top-n queries    7
Dealing with queries overusing resources (sandboxing)    6
WMF experience (aka ops/dev comfort level)    6
Data inference/materialization (automatically, with rules we can define)    6
Intersecting index lookup (population > 101, country = Germany)    5
Easy to support dump all results    5
Upstream support (bug fixes, patch reviews, etc)    5
Can expose native query language (useful because SPARQL and Cypher and powerful and people might be used to them)    4
Modularity (plug in other index stores, plugin in data types)    4
Indexes for multiple traversals    4
Online schema changes    4
Efficiently supports checking qualifiers and references (index or post filtering)    4
Fully free software (no "Enterprise" version) (stuff we need and stuff we may want to add is free software)    3
Cross-DC / multi-cluster replication    3
Vertex-centric indexes    3
Multi operation ACID    3
Easy to support query continuation (better than O(offset) paging)    3
Easy to type query language    2
Implements some standard spec (TinkerPop, something else?)    2
Packaging (deb) and puppetization    2
Well commented source (in case we have to hack on it)    2
Complex geospatial queries (find all points in polygon)    2
Full text indexes (stemming, ranking, etc)    2
JSON object querying support    0
Storage layer designed for graphs    0
Memory efficiency    0
Apache License    0
AGPL    0
GPL    0
Gremlin support    0
SPARQL support    0

Rough edges    -2
Work remaining    -5
Amount of hacking we'll have to do on the graph database layer    -10

Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...