Large Language Models, Knowledge Graphs and Search Engines: A Crossroads for Answering Users’ Questions - BACKGROUND
Hogan, A., Dong, X.L., Vrandevci'c, D., & Weikum, G. (2025). Large Language Models, Knowledge Graphs and Search Engines: A Crossroads for Answering Users' Questions.
A BACKGROUND
A.1 Large Language Models
LLMs capture contextual probabilities of tokens in the parameters of a large neural network, often following the Transformer architecture [44]. The model parameters are computed by two stages of
training: unsupervised pre-training and supervised fine-tuning. LLMs can also benefit from inference-time (i.e., post-training) techniques, most notably, prompt engineering [26] and in-context learning [9].
The usual training objective is to predict the next token in a text sequence, repeatedly in an auto-regressive manner. As the original text is available, the ground-truth is known and this entire training
process is completely unsupervised (or self-supervised, as it is sometimes phrased).
Fine-tuning adopts a pre-trained LLM as a foundational model and adapts it for a suite of specific tasks (such as question answering, chatbot dialog, summarization etc.) by adapting its internal parameters. Supervised fine-tuning (SFT) provides the model with labeled examples of desired input–output pairs specific to the task.
Unlike pre-training, this stage critically depends on labeled data, typically with human annotators/
contributors in the loop. Another related technique is to apply reinforcement learning from human feedback (RLHF), whereby a set of ranked preferences collected from human users over alternative
outputs to the same model input is used to train a separate reward model to predict such preferences; the language model is then fine-tuned by the reward model to better fit its predicted human preferences.
In-context learning enables the model to learn from the user’s input at inference time. A key technique is prompt engineering [26]: phrasing the prompt(s) issued to the model in order to elicit a more favorable response. This can take the form of giving background context about the user or the task at hand, or giving specific instructions to the system. A common technique is to include examples of desired inputs and outputs in the prompt, enabling few-shot learning [9]. Another technique, called chain-of-thought [50], breaks down a complex request into a sequence of intermediate steps. In some use-cases, these techniques lead to impressive behavior, but they are also brittle in the sense that slightly different prompts can lead to inconsistent outputs.
Retrieval-augmented generation combines LLMs with search engines or other information-retrieval technologies to enable the ingestion of fresh Web contents for in-context learning, thus helping to mitigate the effects of stale and hallucinated outputs
A.2 Search Engines
Search Engines (SEs) are powered by Information Retrieval (IR) techniques [3], classically geared towards matching keywords in documents, but the last decade has brought many innovations [28], improving user experience in finding answers.
Collection involves the acquisition and continuous updating of a corpus of web-pages over which users can search. This is largely based on crawling the Web, but also involves subscribing to feeds (e.g., news, social media) and importing information from databases (e.g., product catalogs).
Indexing organizes tokens, words and phrases of documents into inverted-index lists: highly optimized data structures that enable efficiently retrieving documents that match a search request. In addition
to indexing surface tokens, the SE also prepares for similarity search, aka soft matching, by computing an embedding vector for each document (or part of a document).
Query understanding is a key component of modern SEs, which boast a sophisticated suite of techniques for interpreting and enriching users’ queries [5], ideally inferring the user intent rather than staying at the string-matching level. This includes techniques for auto-completion, suggestions for reformulations and topically related queries (e.g., based on query-and-click logs), contextualization with user history and other situative context (e.g., location, time of day, etc.) [23].
Matching involves locating occurrences of the user’s keywords and short phrases as terms in the index, and retrieving all exact matches or possibly partial matches for subsequent ranking. When users pose full-fledged questions or give other kinds of longer inputs, this matching paradigm does not make sense.
Ranking scores matching documents in terms of their likelihood to be of interest to the user. This is based on various factors such as relevance to the search (as determined during the matching phase), prominence, recency, click-through rates, and others.
Result presentation involves the layout of relevant documents, and other related information, retrieved by the engine. The classical SE result page (SERP) is an ordered list of ten blue links, each with a URL and a brief preview snippet. As leading SEs often expand a user query into several vertical-domain searches, including product or event search, sponsored ads and also KG excerpts (so-called knowledge panels), the modern presentation is more elaborate and faceted.
A.3 Knowledge Graphs
The modern notion of Knowledge Graphs (KGs) arose in the context of improving search engines [14]. In 2012, the Google Knowledge Graph was proposed to boost search by “things, not strings” [40]:
This requires a structured repository of real-world entities, their types (e.g., computer scientist), their attributes (e.g., birth date), and the relations between entities (e.g., Blum has birthplace Caracas).
Construction involves the initial creation of a KG from base sources of data. Initially this can involve a mix of information extraction (for text and other unstructured sources) and mapping (for structured sources) techniques. Key to constructing a highquality KG is the disambiguation process, which assigns unique identifiers to distinct entities and relation types in the presence of synonyms (same string, different entities) and homonyms (different strings, same entity).
Completion aims to “fill in the gaps” inevitably left in KGs by integrating data from diverse sources. We consider two paradigms for KG completion: deductive and inductive. Deductive reasoning enables automated inference of new edges in the knowledge graph when complemented by rules or ontologies. ... On the other hand, inductive reasoning does not require the specification of rules/ontologies, but rather learns abstract patterns from the KG and applies these patterns for completion. Knowledge graph embeddings [49] learn tensor representations for nodes and edges in the graph that can be used for link prediction, e.g., to predict citizenships based on other available information in the knowledge graph
Refinement involves the detection and resolution of issues affecting a KG’s quality (i.e., fitness for purpose). While similar methods as used for completion can be adapted for refinement – e.g.,
rules/ontologies can be used to identify logical inconsistencies in a KG, while embeddings can be used to identify edges with low plausibility – dedicated methods also exist for refinement. A key such technique is to leverage shapes, which encode constraints applicable to the KG that can validate and improve its quality [13].
Search & querying involves extracting information from the KG relevant for a particular user or a particular task. KGs often contain text, which can be used to leverage search in a manner similar
to SEs. However, the structure of KGs further permits evaluating database-style queries expressed in graph query languages, which include many features from relational databases (joins, aggregation,
etc.) as well as querying arbitrary-length paths.
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.