Pesquisa Bibliográfica III - Advanced Techniques used for Semantic Search & The Future of Semantic Search
5.1 Ranking
***standard ranking techniques for document-centric keyword search on text, such as: BM25 scoring, language models***
BM25 substitui o TF/IDF no ElasticSearch
BM25 is a ranking function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. It is used by search engines to rank matching documents according to their relevance to a given search query and is often referred to as “Okapi BM25,” since the Okapi information retrieval system was the first system implementing this function. The BM25 retrieval formula belongs to the BM family of retrieval models (BM stands for Best Match)
Fonte: https://doi.org/10.1007/978-0-387-39940-9_921
BM25F for ad-hoc entity retrieval on RDF data
BM25F is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance, term relevance saturation and length normalization.
Fonte: https://en.wikipedia.org/wiki/Okapi_BM25
The language modeling approach to IR directly models that idea: a document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often.
Most language-modeling work in IR has used unigram language models. IR is not the place where you most immediately need complex language models, since IR does not directly depend on the structure of sentences to the extent that other tasks like speech recognition do. Unigram models are often sufficient to judge the topic of a text.
Under the unigram language model the order of words is irrelevant, and so such models are often called ``bag of words'' models.
Language modeling is a quite general formal approach to IR, with many
variant realizations. The original and basic method for using language
models in IR is the query likelihood model .
Fonte: https://nlp.stanford.edu/IR-book/html/htmledition/language-models-for-information-retrieval-1.html
5.1.1 e 5.1.2 entendi muito pouco
5.1.3
ObjectRank adapts PageRank to keyword search on databases. The computed scores depend on the query. Intuitively, a random surfer starts at a database object that matches the keyword and then follows links pertaining to foreign keys. Edge weights are, again, based on types and assigned manually. For example, in a bibliographic database, citations are followed with high probability. Like this, the approach allows relevant objects to be found even if they do not directly mention the query keyword.
TripleRank extends the HITS algorithm to semantic-web data. HITS is a variant of PageRank, which computes hub and authority scores for each node of a sub-graph constructed from the given query.
5.2 Indexing
The inverted index is a well-researched data structure and important for information retrieval in general.
In the simplest realization, a virtual document is constructed for each entity, consisting of (all or a subset of) the words from the triples with that entity as subject. ... Keyword matches for object and predicate names (e.g., find triples where the predicate matches author) at the price of a larger query time compared to vanilla BM25 indexing. In an alternative variant, there is a field for each distinct predicate. This still allows to restrict matches to a certain predicate (e.g., foaf:author) but keyword matches for predicates are no longer possible.
Star-shaped SPARQL queries (with one entity at the center), where predicate and relation names can be matched via keyword queries.
5.3 Ontology Matching and Merging
To cover the data relevant for a given application, often several different knowledge bases need to be considered. .... A problem is that these knowledge bases may contain different representations of the same real-world entity. ... To make proper use of the data, their ontologies (their classes/concepts, properties, relations) as well as their actual population (instances) should either be linked or merged.
Identifying links between classes and properties, which is referred to as ontology matching.
Approaches to ontology matching mainly make use of matching strategies that use terminological and structural data.
Identifying links between instances is known as instance matching. <owl:sameAs>
Similar to ontology matching, to match two instances, their attribute values are compared. This involves using string similarity (e.g., edit distance and extensions, and common q-grams), phonetic
similarity (similar sounding field names are similar, even if they are spelled differently) or numerical similarity (difference) depending on the data type.
Ontology Alignment Evaluation Initiative (OAEI)
Merge ontologies using their alignments into a single coherent knowledge base. Merging these involves merging their schema/ontology (concepts, relations etc.) as well as merging duplicate instances, resolving conflicting names and attribute values.
5.4 Inference
** pode ser um interessante problema de pesquisa **
Surprisingly, only few systems make use of inference as an integral part of their approach to semantic search. Nonetheless, inference will certainly play a more important role in the future.
A lot of triple stores include an inference engine. In addition to triples, these require as input a set of inference rules, for example, that the facts A is ancestor of B, and B is ancestor C imply that A is
ancestor of C. First, we introduce some languages that can be used to express these rules. ... foi até o SWRL mas não fala de SPIN ou SHACL ... We then describe some triple stores and engines (also referred to as reasoners) that allow inference over triples.
https://www.w3.org/wiki/RdfStoreBenchmarking
The Future of Semantic Search
It appears that using the latest invention for a particular sub-problem does not necessarily improve the overall quality for a complex task; very careful engineering is more important.
Machine learning plays an important role in making the current systems better, mostly by learning the best combination of features (many of which have been used in rule-based or manually tuned systems
before) automatically. The results achieved with these approaches consistently outperform previous rule-based or manually tuned approaches by a few percent, but also not more.
The development as described so far is bound to hit a barrier. That barrier is an actual understanding of the meaning of the information that is being sought. We said in our introduction that semantic search
is search with meaning. But somewhat ironically, all the techniques that are in use today (and which we described in this survey) merely simulate an understanding of this meaning, and they simulate it rather
primitively.
OBS: livro online sobre IR que pode ter mais conceitos básicos
https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
uso de inferências para busca pode ser interessante mas será que resolve o problema de delimitar o contexto do buscador?
ResponderExcluir