Pular para o conteúdo principal

New OpenLink Virtuoso hosted Wikidata Knowledge Graph

WD de Dezembro de 2022

From: Kingsley Idehen <kidehen@openlinksw.com>
Subject: Announce: New OpenLink Virtuoso hosted Wikidata Knowledge Graph Release
Date: 11 January 2023 17:51:49 GMT-3
To: wikidata@lists.wikimedia.org, "public-lod@w3.org" <public-lod@w3.org>
Resent-From: public-lod@w3.org

All,

We are pleased to announce immediate availability of an new Virtuoso-hosted Wikidata instance based on the most recent datasets. This instance comprises 17 billion+ RDF triples.

Host Machine Info:

Item     Value
CPU         2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
Cores    24
Memory    378 GB
SSD        4x Crucial M4 SSD 500 GB

Cloud related costs for a self-hosted variant, assuming:

    dedicated machine for 1 year without upfront costs
    128 GiB memory
    16 cores or more
    512GB SSD for the database
    3T outgoing internet traffic (based on our DBpedia statistics)

SPARQL Query and Full Text Search service endpoints:

    https://wikidata.demo.openlinksw.com/sparql -- SPARQL Query Services Endpoint

    https://wikidata.demo.openlinksw.com/fct -- Faceted Search & Browsing

Additional Information

    Loading the Wikidata dataset 2022/12 into Virtuoso Open Source - Announcements - OpenLink Software Community (openlinksw.com)

=============================================================

Rodei a seguinte query para os "disputed by" neste endpoint

SELECT count(distinct ?statement)
WHERE
{
  ?item ?predicate ?statement.
  ?item ?predicate ?value.
  ?statement pq:P1310 ?qualivalue
}

Retornou 1926 (referente ao dump de dez/22).No WDQS retornou  1936 (referente a hoje).E no dataset do kgtk temos 1577(referente a junho/22)

A query dos multiples values pode ser executada configurando o timeout para 120000

SELECT distinct ?item ?predicate ?value1 ?value2
WHERE
{
# ?item wdt:P31 wd:Q5.
  ?item ?predicate ?value1.
  ?item ?predicate ?value2.
  FILTER (?value1 < ?value2).
  FILTER (?predicate not in (schema:description, rdfs:label))
}

Mas eu tentei incluir mais alguns filtros para remover as triplas referentes a reificação e começou a dar timeout

SELECT distinct ?item ?predicate ?value1 ?value2
WHERE
{
# ?item wdt:P31 wd:Q5.
  ?item ?predicate ?value1.
  ?item ?predicate ?value2.
  FILTER (?value1 < ?value2).
        FILTER (strstarts(str(?item), 'http://www.wikidata.org/entity/Q')).
        FILTER (str(?predicate) not in ('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'))
}

Consegui rodar a query abaixo para verificar os valores dos top-10 qualificadores mais usados para o conjunto completo

SELECT ?qualifier count(distinct ?statement) as ?c_quali
WHERE
{
  ?statement ?qualifier ?qualivalue.
  FILTER (?qualifier in (pq:P407, pq:P577, pq:P304, pq:P478, pq:P291, pq:P2093, pq:P1476, pq:P813, pq:P1343, pq:P958))
}

Mas achei algumas quantidades muito diferentes. Será que a remoção de artigos científicos justificaria esta diferença?

Virtuosokgtk
QualifierCountQualifierCount
http://www.wikidata.org/prop/qualifier/P4071410475P4071242876
http://www.wikidata.org/prop/qualifier/P5771003312P577537468
http://www.wikidata.org/prop/qualifier/P304841445P304441380
http://www.wikidata.org/prop/qualifier/P478513899P478187030
http://www.wikidata.org/prop/qualifier/P2093370076P209398772
http://www.wikidata.org/prop/qualifier/P291113265P291105546
http://www.wikidata.org/prop/qualifier/P958108660P95845364
http://www.wikidata.org/prop/qualifier/P147698247P147692759
http://www.wikidata.org/prop/qualifier/P81379584P81368490
http://www.wikidata.org/prop/qualifier/P134336831P134350777




Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graphs as a source of trust for LLM-powered enterprise question answering - Leitura de Artigo

J. Sequeda, D. Allemang and B. Jacob, Knowledge Graphs as a source of trust for LLM-powered enterprise question answering, Web Semantics: Science, Services and Agents on the World Wide Web (2025), doi: https://doi.org/10.1016/j.websem.2024.100858. 1. Introduction These question answering systems that enable to chat with your structured data hold tremendous potential for transforming the way self service and data-driven decision making is executed within enterprises. Self service and data-driven decision making in organizations today is largly made through Business Intelligence (BI) and analytics reporting. Data teams gather the original data, integrate the data, build a SQL data warehouse (i.e. star schemas), and create BI dashboards and reports that are then used by business users and analysts to answer specific questions (i.e. metrics, KPIs) and make decisions. The bottleneck of this approach is that business users are only able to answer questions given the views of existing dashboa...

Knowledge Graph Toolkit (KGTK)

https://kgtk.readthedocs.io/en/latest/ KGTK represents KGs using TSV files with 4 columns labeled id, node1, label and node2. The id column is a symbol representing an identifier of an edge, corresponding to the orange circles in the diagram above. node1 represents the source of the edge, node2 represents the destination of the edge, and label represents the relation between node1 and node2. >> Quad do RDF, definir cada tripla como um grafo   KGTK defines knowledge graphs (or more generally any attributed graph or hypergraph ) as a set of nodes and a set of edges between those nodes. KGTK represents everything of meaning via an edge. Edges themselves can be attributed by having edges asserted about them, thus, KGTK can in fact represent arbitrary hypergraphs. KGTK intentionally does not distinguish attributes or qualifiers on nodes and edges from full-fledged edges, tools operating on KGTK graphs can instead interpret edges differently if they so desire. In KGTK, e...