Filip Ilievski, Kartik Shenoy, Nicholas Klein, Hans Chalupsky, Pedro Szekely
Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA
Abstract—Robust estimation of concept similarity is crucial for a range of AI applications, like
deduplication, recommendation, and entity linking. Rich and diverse knowledge in large
knowledge graphs like Wikidata can be exploited for this purpose.
Introduction
... we need methods to automatically infer whether two arbitrary concepts are identical, dissimilar, or nearly identical [1].
** Conceitos ... mas também é necessário calcular a similaridade entre entidades de um KG **
The task of concept word similarity has been very popular.... Early work generally relies on taxonomy-based methods that leverage the distance between two words in a taxonomy hierarchy.
** Como por exemplo calculando a distância de cada palavra em relação ao mais próximo ancestral em comum **
More recently, pre-trained word embeddings have been shown to natively capture word similarity at scale. Word embeddings may benefit from retrofitting to lexical resources like WordNet. It is unclear
how to best estimate similarity of concepts described in KGs.
** Word Embeddings usam técnicas baseadas em corpus e podem refletir melhor a relação entre as palavras com base em co-ocorrência no mesmo contexto (e não a similaridade do sentido). WordNet aqui é visto como uma taxonomia, um dicionário mas somente terminológico e não um KG **
** Ver no material do Jonatas sobre retrofitting **
Besides language models and taxonomy-based metrics, we can leverage graph embeddings, like TransE and ComplEx , which organize nodes in a geometric space according to their structural links to other nodes. Random walk methods, such as node2vec variants, leverage the generalizability of language modeling, applying it to graph nodes instead of words. Furthermore, the embeddings created by language models (LMs) or KGs can be retrofitted based on background knowledge, coming from the target graph or additional resources.
** Como combinar diferentes métodos? **
Background
Similarity is a central theoretical construct in psychology, facilitating the transfer of a situation to an original training context. Tversky poses that the literal similarity between two objects A and B is proportional to the intersection of their features and inversely proportional to the features that differ (A − B and B − A).
** Definição de similaridade nessa pesquisa **
In this paper, we consider the task of literal similarity between two concepts. Given two concept nodes, c1 and c2 in a KG G, a system is asked to provide a pairwise similarity score sim(c1, c2). We consider similarity to be asymmetric, i.e., sim(c1, c2) ̸= sim(c2, c1). Following common practice in the concept and word similarity tasks, we assume that the similarity of two concepts can be measured on a continuous numeric scale.
Natural Language Processing research has studied the extent to which two concepts are similar or related. Here, similarity likens the notion of literal similarity in psycholinguistics, while relatedness is a broader notion that indicates that two concepts tend to appear in the same topical context.
Framework for estimating similarity
The proposed framework is visually depicted in Figure 1.
** Tem duas fases: uma offline e outra online, como na proposta de busca **
We use graph embedding and text embedding models, as well as ontology-based metrics, as initial similarity estimators. We also concatenate the embeddings in order to combine their scores.
We use retrofitting to further tune the individual embedding models, through distant supervision over millions of weighted pairs extracted automatically from large-scale knowledge graphs. For a given concept pair, the similarity scores generated by the retrofitted embedding models can be combined with the scores by the ontology-based models.
Similarity models - Offline
Graph embedding models - We experiment with four KG embedding models, which can be divided
into: translation based models (TransE [12] and ComplEx [13]) and random walk models (DeepWalk [40] and S-DeepWalk [15]). For all models, we compute the cosine similarity between their embeddings for c1 and c2.
** Ainda não usou os modelos que consideram os qualificadores como elementos diferentes das triplas/quads **
Language models - We use Transformer LMs to represent the textual information node associated with a node in the graph. Similarity between two nodes is then measured through the cosine similarity between two LM embeddings. We experiment with four kinds of textual information: 1) labels, which consider only the English label; 2) labels+desc, which considers a concatenation between a node label and its description; 3) lexicalization, where we automatically generate a node description based on the properties: P31 (instance of), P279 (subclass of), P106 (occupation), P39 (position held), P1382 (partially coincident with), P373 (Commons Category), P452 (industry); and 4) abstract, which is based on the first sentences from entity abstracts in the DBpedia KG, mapped to Wikidata through their sitelinks.
** Transformers para texto ... não é GPT3 ainda **
Self-supervision knowledge - Offline
We tune the original embeddings by selfsupervision to two KGs: Wikidata and ProBase.
We derive three datasets from Wikidata’s subclass-of (P279) ontology...
We define three weighing methods for the generated pairs from these two datasets: (1) constant
weighting value of 1; (2) class similarity between the two nodes (using the class metric described in the last section); and (3) cosine similarity between the concatenated labels and descriptions of the two nodes. ... We focus our experiments on cosine similarity as a weighting function, because we observed empirically that it consistently performs better or comparable to the other two weighting functions.
Retrofitting - Offline
We use the retrofitting technique ..., which iteratively updates node embeddings in order to bring them closer in accordance to their connections in an external dataset.
Experimental setup
We experiment with three benchmarks: 1) WD-WordSim353 ... 2) WD-RG65 is a benchmark which is based on the DBpedia disambiguation ... 3) WD-MC30 is a benchmark which is based on the DBpedia disambiguation ...
We measure the impact of retrofitting with subsets from Wikidata and ProBase, scored based on language models.
We use the KGTK [48] toolkit to lexicalize a node, subset the graphs, and create various graph and language model-based embeddings. We use scikit-learn for supervised learning. We use KGTK’s similarity API to obtain scores for the metrics Class, Jiang Conrath, and TopSim.
** GitHub com os notebooks -> https://github.com/usc-isi-i2/wd-similarity **
Results
How well do different algorithms and combinations capture semantic similarity?
The Abstract-based method performs best among all language model variants, and overall. It outperforms the other LMs because DBpedia’s abstracts contain information that is more comprehensive and tailored to entity types than Wikidata labels, descriptions, or static property sets.
These methods (graph embedding methods) are consistently outperformed by the Lexicalization and Abstract methods, suggesting that the graph embeddings’ wealth of information to consider is a double-edged-sword: many properties are considered that may not be useful for determining similarity, adding distractions that can decrease performance. The Abstract method has an additional advantage over the graph embeddings in that it is less restricted in terms of the kind of information it can consider, whereas the graph embeddings focus solely on relations and can not make use of numeric- or string-valued properties. The combination methods that we evaluated generally did not yield improved performance
over the best individual method (Abstract).
** Métodos de embeddings de texto (LM) foram melhores que embeddings de grafo e a combinação deles não trouxe melhoria **
What is the impact of retrofitting?
Retrofitting is overall beneficial for estimating similarity. On average across the three benchmarks, it improves the performance of nine out of the eleven methods.
The impact of retrofitting is lower on methods that consider richer information already, like Abstract and Lexicalized. This is because these methods already integrate taxonomic information, and retrofitting might bring concepts that are nearly identical or merely related too close in the embedding space.
These findings indicate that similarity between highly similar and dissimilar concepts is well-understood and captured by current methods, whereas the intermediate spectrum of near-identity and relatedness requires further study and focused evaluation.
** Retrofiting melhorou os métodos mas foi mais útil para ajustar métodos de grafo e texto com pouca informação. E dentro desses, melhorou os resultados que não estão nos pontos mais extremos **
Conclusions
The experiments revealed that:
- pairing language models with contextualized information found in abstracts led to optimal performance.
- retrofitting with taxonomic information from Wikidata generally improved performance across methods, with the simpler methods benefiting more from retrofitting.
- retrofitting with the ProBase KG yielded consistently negative results, indicating that the impact of retrofitting directly depends on the quality of the underlying data.
- analysis demonstrated that both vanilla models and retrofitted models perform best on identical and dissimilar pairs.
Experiments on three benchmarks reveal that pairing language models with rich information performs best, whereas the impact of retrofitting is most positive on methods which originally do not consider comprehensive information. The performance of retrofitting depends on the source of knowledge and the edge weighting function.
Future work should investigate contextual similarity between concepts,which would characterize partial identity and relatedness of concept pairs.
Mais sobre retrofitting em NLP
https://odsc.medium.com/the-promise-of-retrofitting-building-better-models-for-natural-language-processing-20783b19cdcb
To use common sense with deep learning, one must connect the curated, organized information about the world — like ConceptNet — with previously unseen, domain-specific data, such as a set of documents to analyze. The best way to do that is a family of algorithms called ‘retrofitting,’ which were first published by Manaal Faruqui in 2015. The goal of retrofitting is to combine structure information like a knowledge graph (ConceptNet or WordNet, for example) with an embedding of word vectors, similar to Word2Vec. By modifying the embedding so that related concepts in the knowledge graph are related in similar ways in the embedding, we’ve applied knowledge-based constraints after training the distributional word vectors. The thinking is that connected terms in the knowledge graph should have vectors that are closer together inside the embedding itself.
** Aqui WordNet já é tratada como um KG. A ideia é calcular os embeddings baseados em corpus e depois ajustar com base na estrutura do KG **
https://krayush.medium.com/retrofitting-word-vectors-to-semantic-lexicons-3f85f4208f4f
Retrofitting Word Vectors to Semantic Lexicons
How to integrate information that comes from existing lexicons to word vectors? A method as published in “Retrofitting Word Vectors to Semantic Lexicons” in NAACL, 2015.
** Aqui WordNet é só o léxico (taxonomia, dicionário) e não um KG **
A quick overview: The paper formulates a post-processing method incorporating the idea of belief propagation across relational graph constructed from the lexicon at hand.
What they propose? How about a setup in which a lexicon is represented as a graph with edges denoting the relation between two nodes (words). Now, each word looks at its neighbors, collect information (their word embeddings) from them and updates itself iteratively.
Comentários
Postar um comentário
Sinta-se a vontade para comentar. CrÃticas construtivas são sempre bem vindas.