Pular para o conteúdo principal

Automatic Question-Answer Generation for Long-Tail Knowledge - Leitura de Artigo

https://knowledge-nlp.github.io/kdd2023/papers/Kumar5.pdf

https://github.com/isunitha98selvan/odqa-tail

ABSTRACT
Pretrained Large Language Models (LLMs) have gained significant attention for addressing open-domain Question Answering (QA). While they exhibit high accuracy in answering questions related to common knowledge, LLMs encounter difficulties in learning about uncommon long-tail knowledge (tail entities).  

[Entidades com poucas informações disponíveis, não tão populares ou comuns no interesse do público em geral]

1 INTRODUCTION

However, the impressive achievements of LLMs in QA tasks are primarily observed with regard to common concepts that frequently appear on the internet (referred to as "head entities"), which are
thus more likely to be learned effectively by LLMs during pretraining time. Conversely, when it comes to dealing with long-tail knowledge, which encompasses rarely occurring entities (referred to as "tail entities"), LLMs struggle to provide accurate answers and often exhibit hallucination issues [5]. Due to the predominant focus of most QA datasets on head entities [ 3 , 6, 10], research investigating the performance of LLMs on long-tail knowledge has been limited.

[Conceito de Long-Tail e seu impacto nos LLMs. Os KGs podem cobrir tanto as tail quanto as head entities e também podem representar alegações em contextos recorrentes (default) e em contextos específicos]

In this study, we propose a novel approach to defining tail entities based on their degree information in Wikidata, as opposed to [7] relying on Wikipedia. By doing so, we generate QA datasets with distinct distributions from previous works [7], thus fostering diversity within tail-knowledge QA datasets. Within the context of Wikidata, the degrees of entities reflect their level of engagement with general knowledge. Hence, we leverage this degree information to define tail entities.

[Métrica para definir entidades tail]

 Moreover, we investigate strategies to enhance the performance of pretrained LLMs by incorporating external resources, such as external documents or knowledge graphs, during inference time on our automatically-generated long-tail QA datasets.

[Integrar LLM e KG]

Introduction of novel tail knowledge QA datasets derived from the Wikidata knowledge graph 

[Teriam exemplos de alegações com contexto neste dataset?]

RELATED WORK

Kandpal et al. [7] show that an LLM’s ability to answer a question is affected by how many times it has
seen relevant documents related to the question in its pre-training data. They show that LLMs struggle to reason accurately over rarer entities in the pre-training data.

In this work, instead of using the pre-training corpus, we define tail entities using Wikidata knowledge
graphs and construct a long-tail knowledge dataset that can be used to study the open-domain QA performance of LLMs.

AUTOMATIC GENERATION OF QA DATASETS FOR LONG-TAIL KNOWLEDGE

 

We define tail entities based on each entity’s node degree (i.e., the number of triplets that have the target entity as a subject node s1) in the knowledge graph. We first sample tail entities based on their degree information and extract all triplets that have the tail entities as the subject entity from Wikidata (proper degree bounds of tail entities will be discussed in the following section). Then we generate factoid questions by prompting LLMs with triplets. 

Prompt

3.2.1 Degree bounds for tail entities. There are no strictly-formulated definitions for tail entities that are widely accepted. Degree bounds that instantly bring in differences in model performance are also
hard to be decided in advance. As a result, degree bounds for tail entities should be selected arbitrarily. In our experiments, we classify entities with node degrees between 15 and 100 as coarse-tail entities and entities with node degrees below 3 as fine-tail entities and compare the LLM performance on them.

[Degree não leva em consideração qualificadores ou referências dos statements ligados os subject node]

Ambiguous entities: Multiple entities can have the same surface forms. 

[Diferenciar pela Identidade da Entidade que não seria o QNode, que é uma chave artificial]

Ambiguous properties: In Wikidata, a large number of properties cannot be used to generate sensible questions. For instance, subclass of, instance of, or part of would generate questions that are too vague to answer even for humans.

[Part of para objetos espaciais pode ser Contexto de Localização/Localidade]

3.2.3 Difficulty control. Questions generated from different properties can have different levels of difficulty. 

[Número de respostas possíveis]

3.2.4 LLM prompt for question generation. While the answer entity of a triplet is not part of the generated question, we find that the quality of generated questions improves when the complete triplet is provided in the prompt, instead of the first two elements (i.e., subject entity and property). For instance, given a triplet [david peel yates, conflict, world war ii], we get "What conflict was David Peel Yates involved in?" from GPT3 when using just the subject entity and property in prompt. On the contrary, when we use all subject, property, and object entities, the generated question becomes "What conflict did David Peel Yates serve in?".

[LLM precisa saber a resposta para formular a melhor pergunta. Como uma pessoa em um processo de exploração pode fazer isto se tiver pouco conhecimento do domínio? Somente com refinamentos sucessivos a partir do que aprender com as respostas anteriores]

3.2.5 Granularity of questions. Given a question, there could be several correct answers with different granularity. Unless the question specifies the granularity of the answer (e.g., which country or which city), QA datasets and models could easily pick different granularity of answers. For instance, when asked Where was Lovelyz formed?, a model could answer South Korea while the QA dataset has Seoul (the capital of South Korea) as the correct answer and marks the predicted answer wrong.

EVALUATION WITH LLMS AND EXTERNAL RESOURCES

Wikidata: Wikidata knowledge graph consists of 103, 305, 143 entities and 11, 007 properties. We access Wikidata using the Sling tool [17] in a triplet format (subject, property, object).

[Não usaram qualificadores e nem referencias]

Tail-entity datasets: We sample triplets from Wikidata to create Coarse-tail and Fine-tail datasets. Each dataset has 27, 691 triplets and 422 unique properties after the difficulty control (details in Section 3.2.3). One question&answer pair consists of a GPT3-generated question, an answer (i.e., object entity in the original triplet), and associated aliases for the answer.

4.4 LLM prompting with DPR and knowledge graphs

Knowledge graphs (KG) have been widely used to augment LLMs [19 , 25 ]. In this section, we examine how external knowledge graphs can cooperate with another external resource, Wikipedia to improve LLM performance for tail entities. We use Wikidata as our external knowledge graph after removing all triplets used for the QA generation.

5 CONCLUSION

Our work highlights the limitations of pre-trained LLMs in hadling long-tail knowledge in open-domain Question Answering. To investigate this limitation, we first propose to generate QA datasets specialized for tail entities automatically using degree information from the Wikidata knowledge graph. Our automatic QA generation approach aims to overcome the resource-intensive nature of manual dataset construction, allowing for the creation of diverse long-tail QA datasets.

[Não usou qualificadores e referencias da WD. Seria um possível trabalho futuro considerar o contexto na métrica para seelcionar as long-tails]


Comentários

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russell, 1972, p. 36.) “Truthmaker theories” hold that in order for any truthbe

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The