Pular para o conteúdo principal

Large Language Models, Knowledge Graphs and Search Engines: A Crossroads for Answering Users’ Questions - BACKGROUND

Hogan, A., Dong, X.L., Vrandevci'c, D., & Weikum, G. (2025). Large Language Models, Knowledge Graphs and Search Engines: A Crossroads for Answering Users' Questions.

A BACKGROUND

A.1 Large Language Models

LLMs capture contextual probabilities of tokens in the parameters of a large neural network, often following the Transformer architecture [44]. The model parameters are computed by two stages of
training: unsupervised pre-training and supervised fine-tuning. LLMs can also benefit from inference-time (i.e., post-training) techniques, most notably, prompt engineering [26] and in-context learning [9].

The usual training objective is to predict the next token in a text sequence, repeatedly in an auto-regressive manner. As the original text is available, the ground-truth is known and this entire training
process is completely unsupervised (or self-supervised, as it is sometimes phrased).

Fine-tuning adopts a pre-trained LLM as a foundational model and adapts it for a suite of specific tasks (such as question answering, chatbot dialog, summarization etc.) by adapting its internal parameters. Supervised fine-tuning (SFT) provides the model with labeled examples of desired input–output pairs specific to the task.

Unlike pre-training, this stage critically depends on labeled data, typically with human annotators/
contributors in the loop
. Another related technique is to apply reinforcement learning from human feedback (RLHF), whereby a set of ranked preferences collected from human users over alternative
outputs to the same model input is used to train a separate reward model to predict such preferences; the language model is then fine-tuned by the reward model to better fit its predicted human preferences.

In-context learning enables the model to learn from the user’s input at inference time. A key technique is prompt engineering [26]: phrasing the prompt(s) issued to the model in order to elicit a more favorable response. This can take the form of giving background context about the user or the task at hand, or giving specific instructions to the system. A common technique is to include examples of desired inputs and outputs in the prompt, enabling few-shot learning [9]. Another technique, called chain-of-thought [50], breaks down a complex request into a sequence of intermediate steps. In some use-cases, these techniques lead to impressive behavior, but they are also brittle in the sense that slightly different prompts can lead to inconsistent outputs.

Retrieval-augmented generation combines LLMs with search engines or other information-retrieval technologies to enable the ingestion of fresh Web contents for in-context learning, thus helping to mitigate the effects of stale and hallucinated outputs

A.2 Search Engines

Search Engines (SEs) are powered by Information Retrieval (IR) techniques [3], classically geared towards matching keywords in documents, but the last decade has brought many innovations [28], improving user experience in finding answers.

Collection involves the acquisition and continuous updating of a corpus of web-pages over which users can search. This is largely based on crawling the Web, but also involves subscribing to feeds (e.g., news, social media) and importing information from databases (e.g., product catalogs).

Indexing organizes tokens, words and phrases of documents into inverted-index lists: highly optimized data structures that enable efficiently retrieving documents that match a search request. In addition
to indexing surface tokens, the SE also prepares for similarity search, aka soft matching, by computing an embedding vector for each document (or part of a document).

Query understanding is a key component of modern SEs, which boast a sophisticated suite of techniques for interpreting and enriching users’ queries [5], ideally inferring the user intent rather than staying at the string-matching level. This includes techniques for auto-completion, suggestions for reformulations and topically related queries (e.g., based on query-and-click logs), contextualization with user history and other situative context (e.g., location, time of day, etc.) [23].

Matching involves locating occurrences of the user’s keywords and short phrases as terms in the index, and retrieving all exact matches or possibly partial matches for subsequent ranking. When users pose full-fledged questions or give other kinds of longer inputs, this matching paradigm does not make sense.

Ranking scores matching documents in terms of their likelihood to be of interest to the user. This is based on various factors such as relevance to the search (as determined during the matching phase), prominence, recency, click-through rates, and others.

Result presentation involves the layout of relevant documents, and other related information, retrieved by the engine. The classical SE result page (SERP) is an ordered list of ten blue links, each with a URL and a brief preview snippet. As leading SEs often expand a user query into several vertical-domain searches, including product or event search, sponsored ads and also KG excerpts (so-called knowledge panels), the modern presentation is more elaborate and faceted.

A.3 Knowledge Graphs

The modern notion of Knowledge Graphs (KGs) arose in the context of improving search engines [14]. In 2012, the Google Knowledge Graph was proposed to boost search by “things, not strings” [40]:

This requires a structured repository of real-world entities, their types (e.g., computer scientist), their attributes (e.g., birth date), and the relations between entities (e.g., Blum has birthplace Caracas).

Construction involves the initial creation of a KG from base sources of data. Initially this can involve a mix of information extraction (for text and other unstructured sources) and mapping (for structured sources) techniques. Key to constructing a highquality KG is the disambiguation process, which assigns unique identifiers to distinct entities and relation types in the presence of synonyms (same string, different entities) and homonyms (different strings, same entity).

Completion aims to “fill in the gaps” inevitably left in KGs by integrating data from diverse sources. We consider two paradigms for KG completion: deductive and inductive. Deductive reasoning enables automated inference of new edges in the knowledge graph when complemented by rules or ontologies. ... On the other hand, inductive reasoning does not require the specification of rules/ontologies, but rather learns abstract patterns from the KG and applies these patterns for completion. Knowledge graph embeddings [49] learn tensor representations for nodes and edges in the graph that can be used for link prediction, e.g., to predict citizenships based on other available information in the knowledge graph

Refinement involves the detection and resolution of issues affecting a KG’s quality (i.e., fitness for purpose). While similar methods as used for completion can be adapted for refinement – e.g.,
rules/ontologies can be used to identify logical inconsistencies in a KG, while embeddings can be used to identify edges with low plausibility – dedicated methods also exist for refinement. A key such technique is to leverage shapes, which encode constraints applicable to the KG that can validate and improve its quality [13].

Search & querying involves extracting information from the KG relevant for a particular user or a particular task. KGs often contain text, which can be used to leverage search in a manner similar
to SEs. However, the structure of KGs further permits evaluating database-style queries expressed in graph query languages, which include many features from relational databases (joins, aggregation,
etc.) as well as querying arbitrary-length paths.

Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...