Pular para o conteúdo principal

Pesquisa Bibliográfica III - Advanced Techniques used for Semantic Search & The Future of Semantic Search

 5.1 Ranking

 ***standard ranking techniques for document-centric keyword search on text, such as: BM25 scoring, language models***

BM25 substitui o TF/IDF no ElasticSearch

BM25 is a ranking function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. It is used by search engines to rank matching documents according to their relevance to a given search query and is often referred to as “Okapi BM25,” since the Okapi information retrieval system was the first system implementing this function. The BM25 retrieval formula belongs to the BM family of retrieval models (BM stands for Best Match)

Fonte:  https://doi.org/10.1007/978-0-387-39940-9_921

BM25F for ad-hoc entity retrieval on RDF data

BM25F is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance, term relevance saturation and length normalization.

Fonte: https://en.wikipedia.org/wiki/Okapi_BM25

The language modeling approach to IR directly models that idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often.

Most language-modeling work in IR has used unigram language models. IR is not the place where you most immediately need complex language models, since IR does not directly depend on the structure of sentences to the extent that other tasks like speech recognition do. Unigram models are often sufficient to judge the topic of a text.  

Under the unigram language model the order of words is irrelevant, and so such models are often called ``bag of words'' models. 

Language modeling is a quite general formal approach to IR, with many variant realizations. The original and basic method for using language models in IR is the query likelihood model .

Fonte: https://nlp.stanford.edu/IR-book/html/htmledition/language-models-for-information-retrieval-1.html

5.1.1 e 5.1.2 entendi muito pouco

5.1.3

ObjectRank adapts PageRank to keyword search on databases. The computed scores depend on the query. Intuitively, a random surfer starts at a database object that matches the keyword and then follows links pertaining to foreign keys. Edge weights are, again, based on types and assigned manually. For example, in a bibliographic database, citations are followed with high probability. Like this, the approach allows relevant objects to be found even if they do not directly mention the query keyword.

TripleRank extends the HITS algorithm to semantic-web data. HITS is a variant of PageRank, which computes hub and authority scores for each node of a sub-graph constructed from the given query.

5.2 Indexing

The inverted index is a well-researched data structure and important for information retrieval in general.

In the simplest realization, a virtual document is constructed for each entity, consisting of (all or a subset of) the words from the triples with that entity as subject. ... Keyword matches for object and predicate names (e.g., find triples where the predicate matches author) at the price of a larger query time compared to vanilla BM25 indexing. In an alternative variant, there is a field for each distinct predicate. This still allows to restrict matches to a certain predicate (e.g., foaf:author) but keyword matches for predicates are no longer possible.

Star-shaped SPARQL queries (with one entity at the center), where predicate and relation names can be matched via keyword queries.

5.3 Ontology Matching and Merging

To cover the data relevant for a given application, often several different knowledge bases need to be considered. .... A problem is that these knowledge bases may contain different representations of the same real-world entity. ... To make proper use of the data, their ontologies (their classes/concepts, properties, relations) as well as their actual population (instances) should either be linked or merged.

Identifying links between classes and properties, which is referred to as ontology matching.

Approaches to ontology matching mainly make use of matching strategies that use terminological and structural data.

Identifying links between instances is known as instance matching. <owl:sameAs>

Similar to ontology matching, to match two instances, their attribute values are compared. This involves using string similarity (e.g., edit distance and extensions, and common q-grams), phonetic
similarity (similar sounding field names are similar, even if they are spelled differently) or numerical similarity (difference) depending on the data type.

Ontology Alignment Evaluation Initiative (OAEI)

Merge ontologies using their alignments into a single coherent knowledge base. Merging these involves merging their schema/ontology (concepts, relations etc.) as well as merging duplicate instances, resolving conflicting names and attribute values. 

5.4 Inference

** pode ser um interessante problema de pesquisa ** 

Surprisingly, only few systems make use of inference as an integral part of their approach to semantic search. Nonetheless, inference will certainly play a more important role in the future.

A lot of triple stores include an inference engine. In addition to triples, these require as input a set of inference rules, for example, that the facts A is ancestor of B, and B is ancestor C imply that A is
ancestor of C. First, we introduce some languages that can be used to express these rules. ... foi até o SWRL mas não fala de SPIN ou SHACL ... We then describe some triple stores and engines (also referred to as reasoners) that allow inference over triples.

https://www.w3.org/wiki/RdfStoreBenchmarking

The Future of Semantic Search

It appears that using the latest invention for a particular sub-problem does not necessarily improve the overall quality for a complex task; very careful engineering is more important.

Machine learning plays an important role in making the current systems better, mostly by learning the best combination of features (many of which have been used in rule-based or manually tuned systems
before) automatically. The results achieved with these approaches consistently outperform previous rule-based or manually tuned approaches by a few percent, but also not more.

The development as described so far is bound to hit a barrier. That barrier is an actual understanding of the meaning of the information that is being sought. We said in our introduction that semantic search
is search with meaning. But somewhat ironically, all the techniques that are in use today (and which we described in this survey) merely simulate an understanding of this meaning, and they simulate it rather
primitively.


OBS: livro online sobre IR que pode ter mais conceitos básicos

https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Comentários

  1. uso de inferências para busca pode ser interessante mas será que resolve o problema de delimitar o contexto do buscador?

    ResponderExcluir

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...