Pular para o conteúdo principal

Postagens

Mostrando postagens de janeiro, 2023

WD References e Contexto de Proveniência

1) Extrair referencia da WD (não tem no conjunto de dados do kgtk) Existem triplas específicas para representar as referência na WD: ?statement prov:wasDerivedFrom ?ref . Baixei os dados de referência da WD em 24/01/2023 com o wdq (base) root@vm096:/home/cloud-di# ls -laht /app/wdq/data/ref* -rw-r--r-- 1 root root 7.1G Jan 25 01:12 /app/wdq/data/references.tsv Converti para ntriples e depois para o kgtk sed -i '/node1/d' /app/kgtk/data/WD5/wdq_references.tsv sed 's/$/  ./' /app/kgtk/data/WD5/wdq_references.tsv > /app/kgtk/data/WD5/wdq_references.nt nohup kgtk --debug import-ntriples --verbose --validate=True \      -i /app/kgtk/data/WD5/wdq_references.nt \      -o /app/kgtk/data/WD5/references.tsv.gz \      --reject-file /app/kgtk/data/WD5/reject-references.tsv.gz & 2) Estatísticas Foram recuperadas 35,670,197 triplas que compõem 13,614,241 referências associadas a 31,601,286 statements. Nestas triplas foram usados 5080 propriedades distintas nas referências

Baixar dados da Wikidata usando o serviço WDQS e o pacote wdq

GitHub -> https://github.com/nichtich/wdq#readme  Passos (pgm python de exemplo pr_list.py) import os import datetime import time Gera a query base (e testa no WDQS) query_base = """PREFIX prov: <http://www.w3.org/ns/prov#> PREFIX wikibase: <http://wikiba.se/ontology#> SELECT (?statement as ?node1) (?pr_pred as ?label) (?ref as ?node2) WHERE {    ?ref ?pr_pred ?pr_obj .    ?statement prov:wasDerivedFrom ?ref . }""" f1 = open("/home/cloud-di/pr_list.txt", 'r', encoding="utf8") pr_list = f1.readlines() f2 = open("/home/cloud-di/pr_list_v2.sh", mode="w", encoding="utf-8") for pr_item in pr_list: Substitui variável por constante para evitar problema de timeout     pr_pred = pr_item.replace("http://www.wikidata.org/prop/reference/", "pr:").strip('"\n' + '\n')     query_exec = query_base.replace ("?pr_pred", pr_pred) #    print(query_exec)

WD Property constraints - Required qualifiers

Levantamento sobre Constraints na WD Required qualifier constraint (Q21510856) (i) specifies that some qualifier is required for this property. For example, head of state (P35) statements should always have a start time (P580) qualifier. This constraint can only be checked on the main value of a statement; if any other constraint scope (P4680) is specified, an error is reported.  This constraint has one mandatory parameter: property (P2306) (ii) ... Contains the required qualifier. Must contain exactly one property. To add multiple required qualifiers, add multiple constraints of this type. constraint status (P2316): qualifier to define a property constraint in combination with P2302. Use values "mandatory constraint" (Q21502408) or "suggestion constraint" (iii) Fonte: https://www.wikidata.org/wiki/Help:Property_constraints_portal/Required_qualifiers Exemplos head of government (P6) - mandatory required qualifier constraint     start time (P580)     start period

New OpenLink Virtuoso hosted Wikidata Knowledge Graph

WD de Dezembro de 2022 From: Kingsley Idehen <kidehen@openlinksw.com> Subject: Announce: New OpenLink Virtuoso hosted Wikidata Knowledge Graph Release Date: 11 January 2023 17:51:49 GMT-3 To: wikidata@lists.wikimedia.org, "public-lod@w3.org" <public-lod@w3.org> Resent-From: public-lod@w3.org All, We are pleased to announce immediate availability of an new Virtuoso-hosted Wikidata instance based on the most recent datasets. This instance comprises 17 billion+ RDF triples. Host Machine Info: Item     Value CPU         2x Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz Cores    24 Memory    378 GB SSD        4x Crucial M4 SSD 500 GB Cloud related costs for a self-hosted variant, assuming:     dedicated machine for 1 year without upfront costs     128 GiB memory     16 cores or more     512GB SSD for the database     3T outgoing internet traffic (based on our DBpedia statistics) SPARQL Query and Full Text Search service endpoints:     https://wikidata.demo.openlinksw.com/spa

Reunião de Orientação - 04/01/2023

TO-DO    1) Classe Disputed X (como território) possuem predicado 1310? Levantar casos de "Disputed" no Label da Classe (não pode ser instância) com kypher Levantar Disputed By com somente um valor Seria um problema de modelagem se a semântica só estiver no rótulo Sem qualificadores, envolvendo mais de um sexo declarado e cor/raça 2) Levantar Exemplo de população X séria histórica com kypher para contexto temporal 3) Elaborar análises sobre Restrições de Integridades (semântica desejada) a) Dirigida ao modelo Constraints na WD Falta de constraints >> Seria um problema de modelagem b) Dirigida pelos dados Inferir o Intencional a partir do Extensional (raciocínio de abdução) Se 80% da propriedade é valor único Se 80% da propriedade tem qualificador Esta é a página do relatório de Constraints violadas na WD -> https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/Summary 4) Elaborar Definições dos Conceitos da Tese BEST POSSIBLE ANSWER ou BETTER

Disputas e Ranking na WD - estatísticas

WD de Junho de 2022 # and % and distribution of % of "disputed by" statements 559,038,971 CLAIMS 1,577 disputed by 0,0028 % Comando (base) rootvm096:/app/kgtk/temp# zcat /app/kgtk/data/my-tsv/disputedBy-claims-sorted.tsv.gz | wc -l 1,578 TOP 10 PROPERTIES (Disputed By) (base) root@vm096:/home/cloud-di# kgtk sort -i /app/kgtk/data/my-tsv/disputedBy-claims-pred-count-label.tsv -c node2 --reverse-columns node2 --numeric-columns node2 / head node1   label   node2   node1;distribution      node1;label P17     count   561     35.5739 'country'@en P3355   count   186     11.7945 'negative therapeutic predictor'@en P3354   count   140     8.8776  'positive therapeutic predictor'@en P131    count   106     6.7216  'located in the administrative territorial entity'@en P31     count   78      4.9461  'instance of'@en P460    count   43      2.7267  'said to be the same as'@en P3359   count   29      1.8389  'negative prognostic predict

Controversas e Incongruências na WD - estatísticas

WD de Junho de 2022 # and % and distribution of %, of controversial statements 559,038,971 CLAIMS 132,552,453 potencialmente controversos 23,71% (base) root@vm096:/home/cloud-di# zcat /app/kgtk/data/wikidata/claims.tsv.gz | wc -l 559,038,972 (base) root@vm096:/home/cloud-di# more /app/kgtk/data/my-tsv/filtered-claims-sorted-uniq.tsv.gz | wc -l 132,552,454 # and % and distribution of % of controversial properties 9,653 PROPERTIES (All) 2,143 potencialmente controversos 22,20 % Comando (base) root@vm096:/app/kgtk/temp# cat /app/kgtk/data/my-tsv/all-claims-pred-counted.tsv | wc -l 9654 (base) root@vm096:/app/kgtk/temp# cat /app/kgtk/data/my-tsv/filtered-pred-count-sorted.tsv | wc -l 2144 TOP 10 PROPERTIES (All) (base) root@vm096:/app/kgtk/temp# kgtk sort -i /app/kgtk/data/my-tsv/all-claims-pred-counted.tsv -c node2 --reverse-columns node2 --numeric-columns node2 / head node1   label   node2 P31     count   59717980 P1215   count   33122376 P528    count   28738709 P17     count   14996553

Filtro Controverso gerando arquivo maior que o original (CLAIMS)

kgtk --debug query -i $GRAPH_CLAIMS --as c --index none --multi 2 \ --match 'c: (item)-[p1]->(value1 {wikidatatype: dt}), (item)-[p2]->(value2)' \ --where 'value1 < value2 and p1.label = p2.label and dt != "external-id" and dt != "wikibase-property"' \ --return 'distinct p1, item, p1.label, value1, p2, item, p2.label, value2' \ -o /app/kgtk/data/my-tsv/filtered-claims.tsv.gz >> /app/kgtk/temp/kgtk_full.log 2>&1   Entrada (base) root@vm096:/home/cloud-di# zcat /app/kgtk/data/wikidata/claims.tsv.gz | wc -l 559,038,972 Saida (base) root@vm096:/home/cloud-di# zcat /app/kgtk/data/my-tsv/filtered-claims.tsv.gz | wc -l 2,137,141,825 O conjunto filtrado é 4x maior que o original pq temos  Original {e1,e2,e3,e4,e5} e1, e2 e1, e3 e1, e4 e2, e3 e2,e4 e3, e4 Filtrado {e1,e2,e1,e3,e1,e4,e2,e3,e2,e4,e3,e4} Quando deveríamos ter {e1,e2,e3,e4} Realizar sort unique para remover duplicatas zcat /app/kgtk/data/my-tsv/filtered-claims.tsv.gz