Dhivya Chandrasekaran and Vijay Mago. 2021. Evolution of Semantic Similarity—A Survey. ACM Comput. Surv. 54, 2, Article 41 (February 2021), 37 pages.
https://doi.org/10.1145/3440755
In the early days, two text snippets were considered similar if they contain the same words/characters. The techniques such as Bag ofWords (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) were used to represent text, as real value vectors to aid calculation of semantic similarity. However, these techniques did not attribute to the fact that words have different meanings and different words can be used to represent a similar concept.
To address these drawbacks of the lexical measures various semantic similarity techniques were proposed over the past three decades. Semantic similarity is often used synonymously with semantic relatedness. However, semantic relatedness not only accounts for the semantic similarity between texts but also considers a broader perspective analyzing the shared semantic properties of two words.
Knowledge-based semantic similarity methods calculate semantic similarity between two terms based on the information derived from one or more underlying knowledge sources, such as on- tologies/lexical databases, thesauri, dictionaries, and so on.
WordNet (the similarity between two words depends on the path distance between them), Wiktionary, Wikipedia (used as a graph to determine the Information Content of concepts, which in turn aids in calculating the semantic similarity), BabelNet
edge-counting methods (caminho, ancestral comum), feature-based methods (vizinhos comuns, definição do termo), and information content-based (mais específico, mais abstrato) methods.
Knowledge-based semantic similarity methods are computationally simple, and the underlying
knowledge base acts as a strong backbone for the models, and the most common problem of am- biguity, such as synonyms, idioms, and phrases, is handled efficiently. However, knowledge-based systems are highly dependent on the underlying source, resulting in the need to update them frequently, which requires time and high computational resources.
Corpus-based semantic similarity methods measure semantic similarity between terms using the information retrieved from large corpora. The underlying principle called “distributional hypoth- esis” [36] exploits the idea that “similar words occur together, frequently”; however, the actual meaning of the words is not taken into consideration. However, among all these measures, the cosine similarity gained significance and has been widely used among NLP researchers to date.
Word embeddings are used to measure semantic similarity between texts of different languages
by mapping the word embedding of one language over the vector space of another. On training with a limited yet sufficient number oftranslation pairs, the translation matrix can be computed to enable the overlap of embeddings across languages.
One of the major challenges faced when deploying word-embeddings to measure similarity is Meaning Conflation Deficiency. It denotes that word embeddings do not attribute to the different meanings of a word that pollutes the semantic space with noise by bringing irrelevant words closer to each other. For example, the words “finance” and “river” may appear in the same semantic space, since the word “bank” has two different meanings
Corpus-based Semantic Similarity Methods
Semantic similarity methods have exploited the recent developments in neural networks to enhance performance. The most widely used techniques include Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), Bidirectional Long Short Term Memory (Bi-LSTM), and Recursive Tree LSTM. Deep neural network models are built based on two fundamental oper- ations: convolution and pooling. The convolution operation in text data may be defined as the sum of the element-wise product ofa sentence vector and a weight matrix. Convolution operations are used for feature extraction. Pooling operations are used to eliminate features that have a negative impact and only consider those feature values that have a considerable impact on the task at hand. There are different types of pooling operations, and the most widely used is Max pooling, where only the maximum value in the given filter space is selected.
Measures: Cosine similarity, Dice coefficient, Manhattan Distance or L1 norm, Jaccard, Euclidean Distance or L2 norm
Estou preparando uma apresentação com esse artigo para melhor esclarecer o Semantic Mismatch da Busca Sintática
ResponderExcluir