Dificuldades do modelo RDF
RDF/SPARQL-centric toolset for operating with KG at scale is heterogeneous, difficult to integrate and only covers a subset of the operations that are commonly needed in data science applications
The recent developments towards supporting triple annotations with RDF* [9] provide support for qualifiers, however this format is still in its infancy and we expect it to inherit the challenges of RDF.
RDF* is a generalization of RDF that allows using triples in the subject
of triples. In KGTK, the same effect is achieved by using the
identifier of an edge as the node1
of an edge. KGTK is more flexible in that identifiers of edges can also be used in the node2
position. Furthermore, in KGTK it is possible to define two edges with identical node1/label/node2
values but different identifiers, making it possible to associate
different sets of secondary edges with the same subject/predicate/object
triple. This is useful in cases where the same subject/predicate/object
triples has different provenance information.
Tecnologias para suporte a manipulação e análise de KG
- Graph databases such as RDF triple stores and Neo4J
- Tools for operating on RDF such as graphy and RDFlib
- entity linking tools such as WAT or BLINK
- entity resolution tools such as MinHash-LSH or MFIBlocks
- libraries to compute graph embeddings such as PyTorch-BigGraph
- libraries for graph analytics, such as graph-tool5and NetworkX.
Scikit-learn and SpaCy: two popular popular toolkits for machine learning and natural language processing
Formato de arquivo
KGTK uses a tab-separated column-based text format to describe any attributed,labeled or unlabeled hypergraph. The first line of a KGTK file declares the headers to be used in the document.The reserved wordsnode1,label andnode2 are used to describe the subject,property and object being described, whilecreator andsource are optional qualifiers for each statement that provide additional provenance information aboutthe creator of a statement and the original source.
<<Achei parecido com o Quad na questão das triplas serem identificadas e poderem ser sujeito ou objeto de outras triplas>>
In KGTK the fourth element is an identifier for an edge (every edge has a unique identifier). The KGTK data model is significantly more flexible as it is possible to associate edges with multiple graphs by using multiple edges on edges.
Exportar para NTriples e LPG
Export modules to transform KGTK format into diverse standard and commonly used formats, such as RDF (N-Triples), property graphs in Neo4J format, and GML to invoke tools such as graph-tool or Gephi.
Filtro de arestas baseado em nós e/ou propriedades (subset)
The filter operation selects edges from an KGTK file, by specifying constraints (“patterns”) on the values for node1, label and node2. The pattern language, inspired by graphy.js, has the following form: “subject-pattern ; predicate-pattern ; object-pattern”. Common query of retrieving edges for all humans from Wikidata corresponds to the filter “ ; P31 ; Q5”.
Junção de grafos baseado em nós e/ou propriedades (subset)
The join operation will join two KGTK files. Inner join, left outer join,right outer join, and full outer join are all supported. When a join takes place,the columns from two files are merged into the set of columns for the outputfile. By default, KGTK will join based on the node1 column, although it can be configured to join by edge id. KGTK also allows the label and node2 columns to be added to the join.
Descoberta de caminhos entre nós
reachable nodes: given a set of nodes N and a set of properties P, this operation computes the set of reachable nodes R that contains the nodes that can be reached from a node n ∈ N via paths containing any of the propertiesin P. This operation can be seen as a (joint) closure computation over one or multiple properties for a predefined set of nodes
Componentes conexos
The connected components operation finds all connected components (communities) in a graph
Embeddings
Text-based graph embeddings using state-of-the-art language models: RoBERTa [13], BERT [5], and DistilBERT [17].
The text embeddings operation computes embeddings for all nodes in a graph by computing a sentence embedding over a lexicalization of the neighborhood of each node. The lexicalized sentence is created based on a template whose simplified version is:
is a {isa-properties},
has {has-properties},
and {properties:values}.
The labels (properties) to be used for label-properties, description-properties, isa-properties, has-properties, and property-values pairs, are specified as input arguments to the operation. Computing similarity between such entity embeddings is a standard component of modern decision making systems such as entity linking, question answering,or table understanding.
Centralidade, PageRank
The graph statistics operation computes various graph statistics and centrality metrics. It computes a graph summary, containing its number of nodes, edges, and most common relations. In addition, it can compute graph degrees, HITS centrality and PageRank values. Aggregated statistics (minimum, maxi-mum, average, and top nodes) for these connectivity/centrality metrics are included in the summary, whereas the individual values for each node are represented as edges in the resulting graph. The graph is assumed to be directed,unless indicated differently.
Ainda não consegui gerar text embeddings do Lattes usando o KGTK
ResponderExcluir