KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis

KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis - Leitura de Artigo

Dificuldades do modelo RDF

RDF/SPARQL-centric toolset for operating with KG at scale is heterogeneous, difficult to integrate and only covers a subset of the operations that are commonly needed in data science applications

The recent developments towards supporting triple annotations with RDF* [9] provide support for qualifiers, however this format is still in its infancy and we expect it to inherit the challenges of RDF.

RDF* is a generalization of RDF that allows using triples in the subject of triples. In KGTK, the same effect is achieved by using the identifier of an edge as the node1 of an edge. KGTK is more flexible in that identifiers of edges can also be used in the node2 position. Furthermore, in KGTK it is possible to define two edges with identical node1/label/node2 values but different identifiers, making it possible to associate different sets of secondary edges with the same subject/predicate/object triple. This is useful in cases where the same subject/predicate/object triples has different provenance information.

Tecnologias para suporte a manipulação e análise de KG

Graph databases such as RDF triple stores and Neo4J
Tools for operating on RDF such as graphy and RDFlib
entity linking tools such as WAT or BLINK
entity resolution tools such as MinHash-LSH or MFIBlocks
libraries to compute graph embeddings such as PyTorch-BigGraph
libraries for graph analytics, such as graph-tool5and NetworkX.

Scikit-learn and SpaCy: two popular popular toolkits for machine learning and natural language processing

Formato de arquivo

KGTK uses a tab-separated column-based text format to describe any attributed,labeled or unlabeled hypergraph. The first line of a KGTK file declares the headers to be used in the document.The reserved wordsnode1,label andnode2 are used to describe the subject,property and object being described, whilecreator andsource are optional qualifiers for each statement that provide additional provenance information aboutthe creator of a statement and the original source.

<<Achei parecido com o Quad na questão das triplas serem identificadas e poderem ser sujeito ou objeto de outras triplas>>

In KGTK the fourth element is an identifier for an edge (every edge has a unique identifier). The KGTK data model is significantly more flexible as it is possible to associate edges with multiple graphs by using multiple edges on edges.

Exportar para NTriples e LPG

Export modules to transform KGTK format into diverse standard and commonly used formats, such as RDF (N-Triples), property graphs in Neo4J format, and GML to invoke tools such as graph-tool or Gephi.

Filtro de arestas baseado em nós e/ou propriedades (subset)

The filter operation selects edges from an KGTK file, by specifying constraints (“patterns”) on the values for node1, label and node2. The pattern language, inspired by graphy.js, has the following form: “subject-pattern ; predicate-pattern ; object-pattern”. Common query of retrieving edges for all humans from Wikidata corresponds to the filter “ ; P31 ; Q5”.

Junção de grafos baseado em nós e/ou propriedades (subset)

The join operation will join two KGTK files. Inner join, left outer join,right outer join, and full outer join are all supported. When a join takes place,the columns from two files are merged into the set of columns for the outputfile. By default, KGTK will join based on the node1 column, although it can be configured to join by edge id. KGTK also allows the label and node2 columns to be added to the join.

Descoberta de caminhos entre nós

reachable nodes: given a set of nodes N and a set of properties P, this operation computes the set of reachable nodes R that contains the nodes that can be reached from a node n ∈ N via paths containing any of the propertiesin P. This operation can be seen as a (joint) closure computation over one or multiple properties for a predefined set of nodes

Componentes conexos

The connected components operation finds all connected components (communities) in a graph

Embeddings

Text-based graph embeddings using state-of-the-art language models: RoBERTa [13], BERT [5], and DistilBERT [17].

The text embeddings operation computes embeddings for all nodes in a graph by computing a sentence embedding over a lexicalization of the neighborhood of each node. The lexicalized sentence is created based on a template whose simplified version is:

{label-properties}, {description-properties}
is a {isa-properties},
has {has-properties},
and {properties:values}.

The labels (properties) to be used for label-properties, description-properties, isa-properties, has-properties, and property-values pairs, are specified as input arguments to the operation. Computing similarity between such entity embeddings is a standard component of modern decision making systems such as entity linking, question answering,or table understanding.

Centralidade, PageRank

The graph statistics operation computes various graph statistics and centrality metrics. It computes a graph summary, containing its number of nodes, edges, and most common relations. In addition, it can compute graph degrees, HITS centrality and PageRank values. Aggregated statistics (minimum, maxi-mum, average, and top nodes) for these connectivity/centrality metrics are included in the summary, whereas the individual values for each node are represented as edges in the resulting graph. The graph is assumed to be directed,unless indicated differently.

Comentários

Veronica dos Santos3 de julho de 2021 às 19:48
Ainda não consegui gerar text embeddings do Lattes usando o KGTK
ResponderExcluir
Respostas

Adicionar comentário

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Pesquisa de Doutorado da Veronica

Pesquisar este blog