Pular para o conteúdo principal

KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis - Leitura de Artigo

Dificuldades do modelo RDF

RDF/SPARQL-centric toolset for operating with KG at scale is heterogeneous, difficult to integrate and only covers a subset of the operations that are commonly needed in data science applications

The recent developments towards supporting triple annotations with RDF* [9] provide support for qualifiers, however this format is still in its infancy and we expect it to inherit the challenges of RDF. 

RDF* is a generalization of RDF that allows using triples in the subject of triples. In KGTK, the same effect is achieved by using the identifier of an edge as the node1 of an edge. KGTK is more flexible in that identifiers of edges can also be used in the node2 position. Furthermore, in KGTK it is possible to define two edges with identical node1/label/node2 values but different identifiers, making it possible to associate different sets of secondary edges with the same subject/predicate/object triple. This is useful in cases where the same subject/predicate/object triples has different provenance information.

Tecnologias para suporte a manipulação e análise de KG

  1. Graph databases such as RDF triple stores and Neo4J
  2. Tools for operating on RDF such as graphy and RDFlib
  3. entity linking tools such as WAT or BLINK
  4. entity resolution tools such as MinHash-LSH or MFIBlocks
  5. libraries to compute graph embeddings such as PyTorch-BigGraph
  6. libraries for graph analytics, such as graph-tool5and NetworkX.

Scikit-learn and SpaCy: two popular popular toolkits for machine learning and natural language processing

Formato de arquivo

KGTK uses a tab-separated column-based text format to describe any attributed,labeled or unlabeled hypergraph. The first line of a KGTK file declares the headers to be used in the document.The reserved wordsnode1,label andnode2 are used to describe the subject,property and object being described, whilecreator andsource are optional qualifiers for each statement that provide additional provenance information aboutthe creator of a statement and the original source.

<<Achei parecido com o Quad na questão das triplas serem identificadas e poderem ser sujeito ou objeto de outras triplas>>  

In KGTK the fourth element is an identifier for an edge (every edge has a unique identifier). The KGTK data model is significantly more flexible as it is possible to associate edges with multiple graphs by using multiple edges on edges.

Exportar para NTriples e LPG

Export modules to transform KGTK format into diverse standard and commonly used formats, such as RDF (N-Triples), property graphs in Neo4J format, and GML to invoke tools such as graph-tool or Gephi.

Filtro de arestas baseado em nós e/ou propriedades (subset)

The filter operation selects edges from an KGTK file, by specifying constraints (“patterns”) on the values for node1, label and node2. The pattern language, inspired by graphy.js, has the following form: “subject-pattern ; predicate-pattern ; object-pattern”.  Common query of retrieving edges for all humans from Wikidata corresponds to the filter “ ; P31 ; Q5”.

Junção de grafos baseado em nós e/ou propriedades (subset)

The join operation will join two KGTK files. Inner join, left outer join,right outer join, and full outer join are all supported. When a join takes place,the columns from two files are merged into the set of columns for the outputfile. By default, KGTK will join based on the node1 column, although it can be configured to join by edge id. KGTK also allows the label and node2 columns to be added to the join.

Descoberta de caminhos entre nós

reachable nodes: given a set of nodes N and a set of properties P, this operation computes the set of reachable nodes R that contains the nodes that can be reached from a node n ∈ N via paths containing any of the propertiesin P. This operation can be seen as a (joint) closure computation over one or multiple properties for a predefined set of nodes

Componentes conexos

The connected components operation finds all connected components (communities) in a graph

Embeddings

Text-based graph embeddings using state-of-the-art language models: RoBERTa [13], BERT [5], and DistilBERT [17]. 

The text embeddings operation computes embeddings for all nodes in a graph by computing a sentence embedding over a lexicalization of the neighborhood of each node. The lexicalized sentence is created based on a template whose simplified version is:

{label-properties}, {description-properties} 
is a {isa-properties},
has {has-properties}, 
and {properties:values}. 

The labels (properties) to be used for label-properties, description-properties, isa-properties, has-properties, and property-values pairs, are specified as input arguments to the operation.  Computing similarity between such entity embeddings is a standard component of modern decision making systems such as entity linking, question answering,or table understanding.

Centralidade, PageRank

The graph statistics operation computes various graph statistics and centrality metrics. It computes a graph summary, containing its number of nodes, edges, and most common relations. In addition, it can compute graph degrees, HITS centrality and PageRank values. Aggregated statistics (minimum, maxi-mum, average, and top nodes) for these connectivity/centrality metrics are included in the summary, whereas the individual values for each node are represented as edges in the resulting graph. The graph is assumed to be directed,unless indicated differently.

Comentários

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...