Scaling Wikidata Query Service - WikidataCon 2021

Scaling Wikidata Query Service - unlimited access to all the world’s knowledge for everyone is hard

Speaker(s): Mike Pham, Guillaume Lederrey, Adam Shorland

Video: https://www.youtube.com/watch?v=oV4qelj9fxM

WDQS = Wikidata Query Service, the largest public SPARQL interface to Wikidata

Blazegraph -> 11 public servers in 2 data centers

Ref out/2021: 95 million Wikidata entities with 13.2 triples

"WDQS is hitting allocator limit on Blazegraph" - https://phabricator.wikimedia.org/T213210

Blazegraph não tem suporte para fragmentação (sharding), são necessários cada vez mais discos maiores. O banco de dados se aproximou - e em alguns servidores, ultrapassou - o limite de alocação permitido no Blazegraph. A escalabilidade é exclusivamente vertical.

O Blazegraph não é mais mantido ativamente. From the chatbox and https://github.com/blazegraph/database, https://en.wikipedia.org/wiki/Blazegraph: It was developed by a company called SYSTAP which was acquired or the developers hired by Amazon. It's written in Java. Source code is on github, but appears to be some years old; conceivably it is still developed within Amazon.

Last stable 2.1.5 / 19 March 2019

From DB-Engines: "Amazon has acquired Blazegraph's domain and (probably) product. It is said that Amazon Neptune is based on Blazegraph."

Cenários catastróficos: não conseguir carregar mais dados ou não conseguir manter a interface de consulta atualizada em relação as alterações. Plano de Recuperação e Mitigação de Desastres.

Opções para reduzir a ocorrência e impacto dessas catastrofes: remover temporariamente dados ou dividir dados no Blazegraph, deixando-os no Wikidata para que não haja perda permanente. O rápido crescimento do Wikidata está causando esse problema.

short term strategies (mitigação): deletar da base de consulta e manter em dumps alguns dados pouco acessados. Por exemplo: vertical slicing = scholarly articles (40% of entities in Wikidata are scholarly articles, but they figure into only 2% of queries). horizontal slicing = non-english descriptions and external IDs

Se Wikidata transferir os dados de bibliografia para outra base / outro KG essa "pressão" diminui.

A federação é provavelmente a solução de longo prazo - isto é dividir permanentemente os dados em diferentes instâncias da Wikibase

WQDS query lag x update: how to balance?

Identificada a necessidade de uma "estratégia de escalonamento", ou seja, um plano / roteiro para o que escalar e quando, como e para quais propósitos

WQDS User Survey: perfil dos usuários via questionário e o que consideram mais importante.

Tempo de timeout seria o mais impactante.

Boa parte dos usuários são da academia, seguido de corporações.

Discusses "freshness" of data. Muitos usuários não precisam de "atualizações em tempo real", embora, quando questionados, digam que desejam isso. (Desejo x Necessidade, Ideal x Essencial)

An alternative is , that is some design that gave fewer timeouts but did not include recent edits in the answer set.

Uma alternativa são os "async results", ou seja, uma solução com mais tempo para consulta (menos tempos limite, timeout), mesmo sem incluir edições recentes no conjunto de respostas.

Usuários "satisfeitos" com SPARQL

From the chat box: the underlying data is in BTree indexes with subject-object-predicate triples. Plus size optimizations. ?????

Próximos passos:

identify and evaluate new back ends (bye-bye Blazegraph)
take pressure off the system by creating alternative search services (like "REST API") that do not burden the systems as much
make it easier to run other instances of Wikibase; make it easier to install and maintain

Survey for user feedback at tinyurl.com/WDQSsurvey ... sugeri a busca de propriedades por rótulo e descrição (EntitySearch)

Give feedback on new back end possibilities or criteria for them.

- a list of criteria to evaluate graph backends for the purpose of scaling WDQS (T291207)

Meu comentário nesse item

Consider graph databases that support RDF-star and SPARQL-star such as RDF4J, AnzoGraph and GraphDB since they are proposed extensions to the RDF and SPARQL standards to provide a more convenient way to annotate RDF statements and to query such annotations (wikidata qualifiers and references), bridging the gap between the RDF world and the Property Graph world.

See W3C Draft Community Group Report 01 July 2021
https://www.w3.org/community/rdf-dev/2021/07/02/new-public-draft-of-the-rdf-star-report/

https://rdf4j.org/documentation/programming/rdfstar/
https://graphdb.ontotext.com/enterprise/devhub/rdf-sparql-star.html
https://cambridgesemantics.com/anzo-platform/

- Evaluate alternatives to Blazegraph (T206560)

Meu comentário nesse item

Create a query set (extracted from the WDQS log) and a Wikidata subset of data to benchmark against graph databases (such as TPC).
Ask graph database vendors to test their products and publish the results to the community.

See
http://tpc.org/
https://github.com/socialsensor/graphdb-benchmarks

==================================================================

A lista de critérios para a escolha do BlazeGraph está nessa planilha, assim como os GraphDBs considerados na época (tem comentários de 2015)

https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0

Bancos de Dados listados na planilha, em itálico os que não receberam pontuação. Alguns conhecidos e outros sem referência:

OrientDB
Titan
Neo4j
ArangoDB
GraphX
WDQ
InfoGrid
BlazeGraph
Accumulo
Postgres
Virtuoso
4store
Apache Jena
Build ElasticGraph
Build SQL + GraphServ

Lista de requisitos (o peso variava de -10 a 30 sendo que alguns tinham peso 0 ????)

Project health (diversity of contributors)    30
Ability to maintain uptime    20
Experimentation console (repl for the graph)    10
Community Health (mailing list, irc, bugs and stuff)    10
Stability of storage layer    9
Handles ten thousand indexes natively (property values, qualifiers, etc)    9
Ability to handle lots of writes    9
Horizontal scalability    8
Query planner    8
Expressiveness of native query language    8
Maturity of distributed version    7
Simple index lookup (population = 101)    7
Indexed range lookup (populate > 100)    7
Geospatial indexes (withing 100 miles of the center of Berlin)    7
Traversal order rewriting    7
Top-n queries    7
Dealing with queries overusing resources (sandboxing)    6
WMF experience (aka ops/dev comfort level)    6
Data inference/materialization (automatically, with rules we can define)    6
Intersecting index lookup (population > 101, country = Germany)    5
Easy to support dump all results    5
Upstream support (bug fixes, patch reviews, etc)    5
Can expose native query language (useful because SPARQL and Cypher and powerful and people might be used to them)    4
Modularity (plug in other index stores, plugin in data types)    4
Indexes for multiple traversals    4
Online schema changes    4
Efficiently supports checking qualifiers and references (index or post filtering)    4
Fully free software (no "Enterprise" version) (stuff we need and stuff we may want to add is free software)    3
Cross-DC / multi-cluster replication    3
Vertex-centric indexes    3
Multi operation ACID    3
Easy to support query continuation (better than O(offset) paging)    3
Easy to type query language    2
Implements some standard spec (TinkerPop, something else?)    2
Packaging (deb) and puppetization    2
Well commented source (in case we have to hack on it)    2
Complex geospatial queries (find all points in polygon)    2
Full text indexes (stemming, ranking, etc)    2
JSON object querying support    0
Storage layer designed for graphs    0
Memory efficiency    0
Apache License    0
AGPL    0
GPL    0
Gremlin support    0
SPARQL support    0
Rough edges    -2
Work remaining    -5
Amount of hacking we'll have to do on the graph database layer    -10

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens: realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward) Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Pesquisa de Doutorado da Veronica

Pesquisar este blog

Scaling Wikidata Query Service - WikidataCon 2021

Scaling Wikidata Query Service - unlimited access to all the world’s knowledge for everyone is hard

==================================================================

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

Exploratory Search: From Finding to Understanding - Leitura de Artigo