Pular para o conteúdo principal

Evolution of Semantic (Text) Similarity—A Survey: Leitura de Artigo

Dhivya Chandrasekaran and Vijay Mago. 2021. Evolution of Semantic Similarity—A Survey. ACM Comput. Surv. 54, 2, Article 41 (February 2021), 37 pages. 

https://doi.org/10.1145/3440755

In the early days, two text snippets were considered similar if they contain the same words/characters. The techniques such as Bag ofWords (BoW), Term Frequency-Inverse Document Frequency (TF-IDF) were used to represent text, as real value vectors to aid calculation of semantic similarity. However, these techniques did not attribute to the fact that words have different meanings and different words can be used to represent a similar concept.

To address these drawbacks of the lexical measures various semantic similarity techniques were proposed over the past three decades. Semantic similarity is often used synonymously with semantic relatedness. However, semantic relatedness not only accounts for the semantic similarity between texts but also considers a broader perspective analyzing the shared semantic properties of two words. 

Knowledge-based semantic similarity methods calculate semantic similarity between two terms based on the information derived from one or more underlying knowledge sources, such as on- tologies/lexical databases, thesauri, dictionaries, and so on. 

WordNet (the similarity between two words depends on the path distance between them), Wiktionary, Wikipedia (used as a graph to determine the Information Content of concepts, which in turn aids in calculating the semantic similarity), BabelNet

edge-counting methods (caminho, ancestral comum), feature-based methods (vizinhos comuns, definição do termo), and information content-based (mais específico, mais abstrato) methods.

Knowledge-based semantic similarity methods are computationally simple, and the underlying
knowledge base acts as a strong backbone for the models, and the most common problem of am- biguity, such as synonyms, idioms, and phrases, is handled efficiently. However, knowledge-based systems are highly dependent on the underlying source, resulting in the need to update them frequently, which requires time and high computational resources.

Corpus-based semantic similarity methods measure semantic similarity between terms using the information retrieved from large corpora. The underlying principle called “distributional hypoth- esis” [36] exploits the idea that “similar words occur together, frequently”; however, the actual meaning of the words is not taken into consideration. However, among all these measures, the cosine similarity gained significance and has been widely used among NLP researchers to date. 

Word embeddings are used to measure semantic similarity between texts of different languages
by mapping the word embedding of one language over the vector space of another. On training with a limited yet sufficient number oftranslation pairs, the translation matrix can be computed to enable the overlap of embeddings across languages. 

One of the major challenges faced when deploying word-embeddings to measure similarity is Meaning Conflation Deficiency. It denotes that word embeddings do not attribute to the different meanings of a word that pollutes the semantic space with noise by bringing irrelevant words closer to each other. For example, the words “finance” and “river” may appear in the same semantic space, since the word “bank” has two different meanings

Corpus-based Semantic Similarity Methods

Semantic similarity methods have exploited the recent developments in neural networks to enhance performance. The most widely used techniques include Convolutional Neural Networks (CNN), Long Short Term Memory (LSTM), Bidirectional Long Short Term Memory (Bi-LSTM), and Recursive Tree LSTM. Deep neural network models are built based on two fundamental oper- ations: convolution and pooling. The convolution operation in text data may be defined as the sum of the element-wise product ofa sentence vector and a weight matrix. Convolution operations are used for feature extraction. Pooling operations are used to eliminate features that have a negative impact and only consider those feature values that have a considerable impact on the task at hand. There are different types of pooling operations, and the most widely used is Max pooling, where only the maximum value in the given filter space is selected.

Measures: Cosine similarity, Dice coefficient, Manhattan Distance or L1 norm, Jaccard, Euclidean Distance or L2 norm

 



 

 

 

Comentários

  1. Estou preparando uma apresentação com esse artigo para melhor esclarecer o Semantic Mismatch da Busca Sintática

    ResponderExcluir

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...