Large Language Models, Knowledge Graphs and Search Engines: A Crossroads for Answering Users’ Questions - Leitura de Artigo 1
Abstract
In particular, there remain many open questions regarding how best to address the diverse information
needs of users, incorporating varying facets and levels of difficulty. This paper introduces a taxonomy of user information needs,
Discussão do ponto de vista das necessidades de informação do usuário.
1 INTRODUCTION
These scenarios, and many other observations discussed in the literature, highlight several fundamental limitations of LLMs:
(1) Hallucination: LLMs are vulnerable to “inventing” situations, facts, events, etc., that are not based on reality and the underlying data.
(2) Opaqueness: It is often unclear what sources are involved in an answer, and how they were combined.
(3) Staleness: Due to the computational and energy costs of training LLMs, it is prohibitive to keep them up to date; they are bound to lag behind on many topics.
(4) Incompleteness: By their probabilistic nature, LLMs give “best-effort” replies, but, in many cases, they are unable to enumerate complete lists of answers.
Limitações ainda encontradas em LLMs ao ponto de não serem considerados como Fonte da VERDADE
An approach gaining traction is to combine LLMs with other complementary technologies. Specifically, combining the Information Retrieval (IR) techniques that power search engines (SEs) with LLMs gives rise to the idea of Retrieval Augmented Generation (RAG) [11, 24]. Given a prompt, SEs are used to find relevant documents of data excerpts (e.g., from Web tables) that can be passed to the LLM as additional inference-time context, often improving the answer quality. Despite such improvements, per the long-tail example of Figure 2, these limitations sometimes persist even with RAG enabled, due either to being unable to retrieve relevant information, or due to the information retrieved being represented in a manner not conducive to in-context learning
Combinar LLM com Busca usando a abordagem RAG
KGs use graphbased representations in order to structure, integrate, query and reason about diverse collections of data and knowledge.
Outra definição para KG, já que não existe consenso
2 SE VS. KG VS. LLM
Dimensões de interesse para a análise
Correctness refers to the extent to which the information returned is correct. SEs and KGs store explicit representations of their (indexed) contents: full text documents, tables and nodes/edges. LLMs, on the other hand, only capture statistical patterns from the input corpus that can be used to generate text following patterns similar to the input.
LLM não produzem resultados determinísticos
Coverage refers to the broadness and comprehensiveness of the content covered. ... SEs and LLMs dominate in terms of the broadness of their coverage, which is delimited by the wider domain of information available in natural language, but there exist domain-specific KGs for verticals like finance, health, energy and more, which can be key assets for applications where data quality is crucial.
Cobertura requer esforço humano para Curadoria em Engenharia de Conhecimento em KGs
Completeness relates to the ability to return all relevant information from the base corpus. ....
Necessário um gabarito para definir informação relevante
Freshness captures the degree to which the information returned is up-to-date. ... For example, while the Wikidata KG has tens of thousands of users active in the past month, this pales in comparison to the number of users posting about current events on social media, meaning that Wikidata can lag behind, particularly for long-tail events.
Apesar de ser menos custoso computacionalmente atualizar o KG
Generation encapsulates the ability to derive custom, novel content from the base corpus in order to better address a user’s information need. LLMs are generative by design, allowing text to be recursively generated from a starting prompt. KGs are not generative per se, though ontologies and rules can generate newknowledge via deductive reasoning, while knowledge graph embeddings and graph neural networks can generate new knowledge via inductive reasoning; techniques such as summarization are also generative in nature.
Duas formas de gerar CONHECIMENTO em KGs. LLMs geram TEXTO.
Transparency speaks of the ability to reproduce or otherwise understand how and why information is returned with respect to the base corpus. KGs are transparent, where the results provided for a query can be explained in terms of their derivation steps and (where available) underlying data provenance.
Capacidade de ser explicável e indicar as fontes orginais das informações (contexto)
Coherency refers to the ability to produce logically-consistent responses across repetitions of the same request or invocations of logically-related requests over the same base corpus. Repeating the same request to SEs and KGs will typically yield deterministic results, ..... Conversely, LLMs are often made non-deterministic by design, which enables more diverse – but less reproducible – responses.
Como testar e validar respostas não determinísticas?
Fairness refers to returning unbiased information. SEs, KGs and LLMs can all suffer from bias. .... LLMs can then regurgitate and even amplify biases, prejudices and other harmful tropes found in their
training corpora.
Todas podem ter viés, depende da fonte e de quem é o engenheiro do conhecimento.
Usability references the ease with which users can request the information they need. SEs and LLMs are based on natural language, which greatly boosts their usability; they “speak” our language. Conversely, to exploit the full power of KGs, users must write structured queries in a formal language: a high barrier for many. Thus, while users often interact directly with SEs and LLMs, users most often interact indirectly with KGs via end-user applications. On the positive side, natural language interfaces that map user questions (e.g., “Which Turing Award winners were born in Latin America?”) into structured queries (e.g., Cypher, SPARQL, etc.) are being advanced to improve KG usability
Natural Language Interfaces for Databases
Expressivity refers to the ability to express and respond to potentially complex requests.
Multilingualism captures the ability to seamlessly handle base information, requests and responses in different languages with comparable performance. SEs and LLMs can provide divergent responses for analogous requests in different languages..... Conversely, KGs apply a graph-based abstraction of
data and knowledge that is largely agnostic to natural language, though nodes and edges need to be (often manually) associated with human-readable names and descriptions in different languages
in order for the KG to be applicable in a multilingual scenario
WD em múltiplos idiomas
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.