Pular para o conteúdo principal

Pesquisa Bibliográfica II - Approaches and Systems for Semantic Search

Continuando o post sobre o survey de Busca Semântica ... seção 4

Cada subseção tem um tópico sobre Benchmark pq em Information Retrieval, assim como em BD, é uma prática  comum usar becnhmarks para avaliar sistemas/propostas. 

Cada subseção começa com uma tabela para caracterizar o grupo de sistemas que serão analisados contendo as seguintes itens: Data e Search (duas dimensões da classificação), Approach (descrição dos passos ou das diretrizes comuns do grupo), Forças e Limitações.

4.1 Keyword Search in Text

Basic techniques in matching are: lemmatization or stemming (houses -> house or hous) ... ainda não temos solução para portuguese stemmer no busc@NIMA ... , synonyms (search -> retrieval) ... a WordNet em Português pode ajudar (synsets) ..., error correction (algoritm -> algorithm), relevance feedback (given some relevant documents, enhance the query to find more relevant documents), proximity (of some or all of the query words) and concept models (matching the topic of a document, instead of or in addition to its words).

4.2 Structured Search in Knowledge Bases

When the knowledge base is stored in an RDBMS and the query language is SPARQL, queries can be translated to equivalent SQL queries. When the data is stored in an RDBMS using a non-trivial schema
(that is, not just one big table of triples), a mapping is needed to specify how to make triples out of this data. For this mapping, R2RML [2012] has emerged as a standard. ... Padrão W3C que o D2RQ usa e estudei em 2019.2 .... Given such a mapping, generating a SQL query that can be executed as efficiently as possible becomes a non-trivial problem. 

Dedicated triple stores can use index data structures that are tailored to sets of triples (in particular, exploiting the high repetitiveness and hence compressibility involved). ... Vantagem em relação a um RDBMS ....

RDF-3X builds an index for each of the six possible permutations of a triple (SPO, SOP, OPS, OSP, POS, PSO, where S = subject, P = predicate, O = object) ... AllegroGraph também tem essas 6 combinações como índices padrão de cada repositório ... . This enables fast retrieval of the matching subset for each part of a SPARQL query. Join orders are optimized for typical SPARQL queries, including starshaped (all triples have the same variable as their subject) and paths (the object of one triple is the subject of the next). Query plans are ranked using standard database techniques, like estimating the cost via histogram counts. ... existe um gerador de planos como em um RDBMS ....

Triple Stores: Jena, Sesame, and Virtuoso

4.3 Structured Data Extraction from Text

Relationship extraction aims at extracting subject-predicate-object tuples from a given collection of natural language text. 

There is also vast literature on domain-specific extraction, in particular, for the life sciences. For example, extract all pairs of proteins (subject and object) that interact in a certain way (predicate) from a large collection of pertinent publications. The main challenge for such systems is domain-specific knowledge (e.g., the many variants how protein names are expressed in text), which is beyond the scope of this survey. ... o Cristóvão comentou sobre essa abordagem na questão do WorkFlow do INCA ...

For Knowledge Base Construction: (1) entity resolution, sometimes also called entity de-duplication,
strings referring to the same entity must be mapped to a unique identifier for that entity ... Entity Linkage também é outra forma de chamar essa técnica, usamos para mapear os professores das disciplinas e os CVs através do nome ... and (2) knowledge fusion, different triples might contain conflicting or complementary information, which needs to be resolved or unified.

    4.3.3 Systems for Specialized Extraction 

WebKB [Craven et al., 1998] was one of the first systems to extract triples from hyperlinked documents, namely the website of a computer science department. In their approach, web pages stand for entities (for example, the homepage of a person stands for that person) and links
between web pages indicate relations (for example, a link between a person’s homepage and the department homepage is a strong indicator that that person works in that department).... Se não tivéssemos a informação do Depto dos professores essa poderia ser uma forma de obter mas para isso a informação da homepage do professor deve estar preenchida .... The correspondence between web pages and entities is learned in a supervised fashion using a
Naive Bayes classifier with standard word features. Relations are also learned using FOIL (rule-based, supervised learning) with link paths (for example, a link from a person to a department) and anchor text (for example, the word department in the anchor text) as features.

4.4 Keyword Search on Knowledge Bases

  1. Match keywords to entities from the knowledge base; -> Keywords are mapped to nodes in
    this graph. Typically, an inverted index over the (words of the) entity, class, or relation names is used.
  2. Generate candidates for SPARQL queries from these matching entities; -> It is
    challenging to formulate these queries.
  3. Rank candidate queries using graph, lexical, and IR measures; 

Obs.: some overlap with the techniques of 4.8 Question Answering on Knowledge Bases

To improve usability, some systems also include user feedback in the translation process. This is done, for example, by suggesting keyword completions that lead to results .... expansão de consulta ..., or by allowing the user to select the correct interpretation for each keyword (when several are possible) .... desambiguação ....

Systems for keyword search on relational databases: DBXplorer and DISCOVER use the number of joins to rank answers, while BANKS tries to find the smallest matching subgraph

Evaluation metrics are MAP (mean average precision) and precision at 10. ... ou seja, se a precisão for alta nos TOP K (K=10) não teria problema piorar a cobertura mas é preciso um critério de ordenação do resultado adequado ao contexto ...

4.5 Keyword Search on Combined Data

Log analysis: on a large query log from a commercial web-search engine, 40% of queries are for a particular entity (e.g., neil armstrong), 12% are for a particular lists of entities (e.g., astronauts who walked on the moon), and 5% are asking for a particular attribute of a particular entity (e.g., birth date neil armstrong). ... no busc@NIMA temos entidades em particular (pelo nome do pesquisador) e lista de entidades (pesquisadores que já publicaram ou lecionaram sobre <palavras de busca> ....

The Semantic Web allows users to provide explicit links between such entities, notably via relations such as owl:sameAs ... usamos no busc@NIMA para linkar matrícula e CV Lattes .... or dbpedia:redirect/disambiguate. Not surprisingly, making use of such links can considerably improve result quality.

Relembrando que "combined data" = text linked to a knowledge base, multiple knowledge bases, or semantic web data

4.6 Semi-Structured Search on Combined Data

Store data in an inverted index or extensions of it; use separate indexes for the text and the knowledge
base or use tailor-made combined indexes; provide special-purpose user interfaces adapted for the particular kind of search .... NoSQL polystores combinam grafo/triplas com texto (json) ....

4.7 Question Answering on Text

4.8 Question Answering on Knowledge Bases

There is some overlap with Section 4.4 on Keyword Search on Knowledge Bases, which is discussed at the beginning of that section.

Natural language questions are often longer and provide more information than keyword queries. For example, compare in what films did quentin tarantino play to quentin tarantino films. The natural language question is more explicit about the expected type of result (films) and more precise about the relation (films in which Quentin Tarantino acted, not films which he directed). At the same time, natural language questions can also be more complex.

Facebook Graph Search supports personalized searches on the relations between persons, places, tags, pictures, etc. An example query is photos of my friends taken at national parks. Results are based on the relationships between the user and her friends and their interests expressed on Facebook. Graph Search was introduced by Facebook in March 2013. It was reduced to a much restricted version (eliminating most search patterns) in December 2014, mainly due to privacy issues

Google Search answers an increasing fraction of natural language queries from its internal knowledge base, called Knowledge Graph. As of this writing, the Knowledge Graph is based on Freebase (and not on the much larger Knowledge Vault described in Section 4.3.4) and there is no published work on how this search works.

QALD - annual benchmark of manually selected natural language queries with their SPARQL equivalent

4.9 Question Answering on Combined Data

IBM’s Watson - special issue of the IBM Journal by Pickover [2012] consisting of a series of twelve papers (each about 10 pages) solely about Watson




 


Comentários

  1. Usar a WordNet para expandir consultas do Busc@/Quem@ com sinônimos em múltiplos idiomas é promissor mas para evitar reduzir a precisão dos primeiros resultados (Top K) é preciso pensar em um critério de ordenação.

    ResponderExcluir

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s...

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russe...

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The...