Pular para o conteúdo principal

Evolution of Semantic (Text) Similarity—A Survey: Leitura de Artigo

Dhivya Chandrasekaran and Vijay Mago. 2021. Evolution of Semantic Similarity—A Survey. ACM Comput. Surv. 54, 2, Article 41 (February 2021), 37 pages. 

https://doi.org/10.1145/3440755

In the early days, two text snippets were considered similar if they contain the same words/characters. The techniques such as Bag ofWords (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) were used to represent text, as real value vectors to aid calculation of semantic similarity. However, these techniques did not attribute to the fact that words have different meanings and different words can be used to represent a similar concept.

To address these drawbacks of the lexical measures various semantic similarity techniques were proposed over the past three decades. Semantic similarity is often used synonymously with semantic relatedness. However, semantic relatedness not only accounts for the semantic similarity between texts but also considers a broader perspective analyzing the shared semantic properties of two words. 

Knowledge-based semantic similarity methods calculate semantic similarity between two terms based on the information derived from one or more underlying knowledge sources, such as on- tologies/lexical databases, thesauri, dictionaries, and so on. 

WordNet (the similarity between two words depends on the path distance between them), Wiktionary, Wikipedia (used as a graph to determine the Information Content of concepts, which in turn aids in calculating the semantic similarity), BabelNet

edge-counting methods (caminho, ancestral comum), feature-based methods (vizinhos comuns, definição do termo), and information content-based (mais específico, mais abstrato) methods.

Knowledge-based semantic similarity methods are computationally simple, and the underlying
knowledge base acts as a strong backbone for the models, and the most common problem of am- biguity, such as synonyms, idioms, and phrases, is handled efficiently. However, knowledge-based systems are highly dependent on the underlying source, resulting in the need to update them frequently, which requires time and high computational resources.

Corpus-based semantic similarity methods measure semantic similarity between terms using the information retrieved from large corpora. The underlying principle called “distributional hypoth- esis” [36] exploits the idea that “similar words occur together, frequently”; however, the actual meaning of the words is not taken into consideration. However, among all these measures, the cosine similarity gained significance and has been widely used among NLP researchers to date. 

Word embeddings are used to measure semantic similarity between texts of different languages
by mapping the word embedding of one language over the vector space of another. On training with a limited yet sufficient number oftranslation pairs, the translation matrix can be computed to enable the overlap of embeddings across languages. 

One of the major challenges faced when deploying word-embeddings to measure similarity is Meaning Conflation Deficiency. It denotes that word embeddings do not attribute to the different meanings of a word that pollutes the semantic space with noise by bringing irrelevant words closer to each other. For example, the words “finance” and “river” may appear in the same semantic space, since the word “bank” has two different meanings

Corpus-based Semantic Similarity Methods

Semantic similarity methods have exploited the recent developments in neural networks to enhance performance. The most widely used techniques include Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), Bidirectional Long Short Term Memory (Bi-LSTM), and Recursive Tree LSTM. Deep neural network models are built based on two fundamental oper- ations: convolution and pooling. The convolution operation in text data may be defined as the sum of the element-wise product ofa sentence vector and a weight matrix. Convolution operations are used for feature extraction. Pooling operations are used to eliminate features that have a negative impact and only consider those feature values that have a considerable impact on the task at hand. There are different types of pooling operations, and the most widely used is Max pooling, where only the maximum value in the given filter space is selected.

Measures: Cosine similarity, Dice coefficient, Manhattan Distance or L1 norm, Jaccard, Euclidean Distance or L2 norm

 



 

 

 

Comentários

  1. Estou preparando uma apresentação com esse artigo para melhor esclarecer o Semantic Mismatch da Busca Sintática

    ResponderExcluir

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s...

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russe...

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The...