Pular para o conteúdo principal

Pesquisa Bibliográfica III - Advanced Techniques used for Semantic Search & The Future of Semantic Search

 5.1 Ranking

 ***standard ranking techniques for document-centric keyword search on text, such as: BM25 scoring, language models***

BM25 substitui o TF/IDF no ElasticSearch

BM25 is a ranking function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. It is used by search engines to rank matching documents according to their relevance to a given search query and is often referred to as “Okapi BM25,” since the Okapi information retrieval system was the first system implementing this function. The BM25 retrieval formula belongs to the BM family of retrieval models (BM stands for Best Match)

Fonte:  https://doi.org/10.1007/978-0-387-39940-9_921

BM25F for ad-hoc entity retrieval on RDF data

BM25F is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance, term relevance saturation and length normalization.

Fonte: https://en.wikipedia.org/wiki/Okapi_BM25

The language modeling approach to IR directly models that idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often.

Most language-modeling work in IR has used unigram language models. IR is not the place where you most immediately need complex language models, since IR does not directly depend on the structure of sentences to the extent that other tasks like speech recognition do. Unigram models are often sufficient to judge the topic of a text.  

Under the unigram language model the order of words is irrelevant, and so such models are often called ``bag of words'' models. 

Language modeling is a quite general formal approach to IR, with many variant realizations. The original and basic method for using language models in IR is the query likelihood model .

Fonte: https://nlp.stanford.edu/IR-book/html/htmledition/language-models-for-information-retrieval-1.html

5.1.1 e 5.1.2 entendi muito pouco

5.1.3

ObjectRank adapts PageRank to keyword search on databases. The computed scores depend on the query. Intuitively, a random surfer starts at a database object that matches the keyword and then follows links pertaining to foreign keys. Edge weights are, again, based on types and assigned manually. For example, in a bibliographic database, citations are followed with high probability. Like this, the approach allows relevant objects to be found even if they do not directly mention the query keyword.

TripleRank extends the HITS algorithm to semantic-web data. HITS is a variant of PageRank, which computes hub and authority scores for each node of a sub-graph constructed from the given query.

5.2 Indexing

The inverted index is a well-researched data structure and important for information retrieval in general.

In the simplest realization, a virtual document is constructed for each entity, consisting of (all or a subset of) the words from the triples with that entity as subject. ... Keyword matches for object and predicate names (e.g., find triples where the predicate matches author) at the price of a larger query time compared to vanilla BM25 indexing. In an alternative variant, there is a field for each distinct predicate. This still allows to restrict matches to a certain predicate (e.g., foaf:author) but keyword matches for predicates are no longer possible.

Star-shaped SPARQL queries (with one entity at the center), where predicate and relation names can be matched via keyword queries.

5.3 Ontology Matching and Merging

To cover the data relevant for a given application, often several different knowledge bases need to be considered. .... A problem is that these knowledge bases may contain different representations of the same real-world entity. ... To make proper use of the data, their ontologies (their classes/concepts, properties, relations) as well as their actual population (instances) should either be linked or merged.

Identifying links between classes and properties, which is referred to as ontology matching.

Approaches to ontology matching mainly make use of matching strategies that use terminological and structural data.

Identifying links between instances is known as instance matching. <owl:sameAs>

Similar to ontology matching, to match two instances, their attribute values are compared. This involves using string similarity (e.g., edit distance and extensions, and common q-grams), phonetic
similarity (similar sounding field names are similar, even if they are spelled differently) or numerical similarity (difference) depending on the data type.

Ontology Alignment Evaluation Initiative (OAEI)

Merge ontologies using their alignments into a single coherent knowledge base. Merging these involves merging their schema/ontology (concepts, relations etc.) as well as merging duplicate instances, resolving conflicting names and attribute values. 

5.4 Inference

** pode ser um interessante problema de pesquisa ** 

Surprisingly, only few systems make use of inference as an integral part of their approach to semantic search. Nonetheless, inference will certainly play a more important role in the future.

A lot of triple stores include an inference engine. In addition to triples, these require as input a set of inference rules, for example, that the facts A is ancestor of B, and B is ancestor C imply that A is
ancestor of C. First, we introduce some languages that can be used to express these rules. ... foi até o SWRL mas não fala de SPIN ou SHACL ... We then describe some triple stores and engines (also referred to as reasoners) that allow inference over triples.

https://www.w3.org/wiki/RdfStoreBenchmarking

The Future of Semantic Search

It appears that using the latest invention for a particular sub-problem does not necessarily improve the overall quality for a complex task; very careful engineering is more important.

Machine learning plays an important role in making the current systems better, mostly by learning the best combination of features (many of which have been used in rule-based or manually tuned systems
before) automatically. The results achieved with these approaches consistently outperform previous rule-based or manually tuned approaches by a few percent, but also not more.

The development as described so far is bound to hit a barrier. That barrier is an actual understanding of the meaning of the information that is being sought. We said in our introduction that semantic search
is search with meaning. But somewhat ironically, all the techniques that are in use today (and which we described in this survey) merely simulate an understanding of this meaning, and they simulate it rather
primitively.


OBS: livro online sobre IR que pode ter mais conceitos básicos

https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Comentários

  1. uso de inferências para busca pode ser interessante mas será que resolve o problema de delimitar o contexto do buscador?

    ResponderExcluir

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s...

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russe...

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The...