kgtk --debug query -i $GRAPH_CLAIMS --as c --index none --multi 2 \
--match 'c: (item)-[p1]->(value1 {wikidatatype: dt}), (item)-[p2]->(value2)' \
--where 'value1 < value2 and p1.label = p2.label and dt != "external-id" and dt != "wikibase-property"' \
--return 'distinct p1, item, p1.label, value1, p2, item, p2.label, value2' \
-o /app/kgtk/data/my-tsv/filtered-claims.tsv.gz >> /app/kgtk/temp/kgtk_full.log 2>&1
Entrada
(base) root@vm096:/home/cloud-di# zcat /app/kgtk/data/wikidata/claims.tsv.gz | wc -l
559,038,972
Saida
(base) root@vm096:/home/cloud-di# zcat /app/kgtk/data/my-tsv/filtered-claims.tsv.gz | wc -l
2,137,141,825
O conjunto filtrado é 4x maior que o original pq temos
Original {e1,e2,e3,e4,e5}
e1, e2
e1, e3
e1, e4
e2, e3
e2,e4
e3, e4
Filtrado {e1,e2,e1,e3,e1,e4,e2,e3,e2,e4,e3,e4}
Quando deveríamos ter {e1,e2,e3,e4}
Realizar sort unique para remover duplicatas
zcat /app/kgtk/data/my-tsv/filtered-claims.tsv.gz | sort -u > /app/kgtk/data/my-tsv/filtered-claims-sorted-uniq.tsv
Nova Saida
(base) root@vm096:/home/cloud-di# more /app/kgtk/data/my-tsv/filtered-claims-sorted-uniq.tsv.gz | wc -l
132,552,454
O percentual de potencialmente controversos é 23,71%
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.