Report from Dagstuhl Seminar - Knowledge Graphs and their Role in the Knowledge Engineering of the 21st Century

Setembro de 2022

Site oficial -> https://www.dagstuhl.de/22372

Comentários Gerais

A definição de conhecimento mencionada me parece associada a uma Verdade Absoluta (e não Relativa ou Dual)
"If we define knowledge as a set of beliefs that are “(i) true, (ii) certain, [and] (iii) obtained by a reliable process”"

A importancia da proveniência em KG para estabelecer a origem da informação e a confiança
3.6 Human-Centric Knowledge Engineering: Making Knowledge Engineering Trustworthy
5.5 Knowledge Graphs vs. Other Forms of Knowledge Representation

A necessidade de intervenção humana no processo automatizado de construção do KG
3.2.3.3 Human Curation of Automatically Generated Knowledge is Needed for Trust
3.4 Automated Knowledge Graph Construction
3.6 Human-Centric Knowledge Engineering: Making Knowledge Engineering Trustworthy

A presença de viés e de controversas em KG e a ausência de contexto (ou o contexto implícito) como um tipo de viés
3.9 Social and Technical Biases in Knowledge Graphs
5.6 Bias in Knowledge Graph Systems

Representação de relações n-aria
4.3 Triples are not Enough
4.7 Shifting from a Triple-centric View to a Knowledge Components View in KGs

Diferentes tipos de conhecimento podem ser representados em KGs e destacam a necessidade de contexto para alguns tipo de conhecimento
Na tabela 1 separam conhecimento Factual de Argumentos, Alegações, Senso Comum, Perpectivas/Narrativas.
Na tabela 2 está relacionando este tipo de conhecimento com alguns domínios.
5.4 Construction with Modalities and Types
5.5 Knowledge Graphs vs. Other Forms of Knowledge Representation

E vários pontos sobre a integração em Modelo de Linguagem e KG, me lembrei do GraphGPT

O tópico 4.10 Modelling Complex Concepts fala do projeto sobre cheiros que o Daniel comentou.

Destaques

3.2.1 An Approach to the History of Knowledge Engineering

Knowledge engineering as a discipline has changed considerably since its initial flowering during the period associated with expert systems development during the nineteen-eighties. If we define knowledge as a set of beliefs that are “(i) true, (ii) certain, [and] (iii) obtained by a reliable process” [ 2], we can further define knowledge engineering as the discipline of building and maintaining processes that produce knowledge.

3.2.3.1 Manually Authored Knowledge from Subject Matter Experts is Precious

The digital library community has long argued that manually-created metadata is of vital importance in the creation of robust search resources, and much of the development of the World Wide Web (and continuing on to the Open Linked Data cloud) was informed by that assumption [ 15 ]. The effort of designing ontologies, taxonomies, and entity and relationship data has historically depended on expensive, labor-intensive manual effort.

3.2.3.3 Human Curation of Automatically Generated Knowledge is Needed for Trust

While automated systems can produce large knowledge graphs, they are limited in their ability to interpret and contextualize this output (though with the advent of language models this may be changing). Human curation is needed to verify that the knowledge graph production process is accurate. This process of verification is a necessary condition in many applications for users to be able to trust the knowledge and use it effectively. Additionally, human curation can provide insights into the data that automated systems may miss, such as potential ethical implications, biases, and areas for improvement.

3.4 Automated Knowledge Graph Construction

Over the past decade, many methods have been proposed for KGC: human-based collaborative or curated approaches in which experts work together to create and curate knowledge graphs, but also automated approaches, classified broadly into approaches that use a predefined schema for extraction, versus open information extraction (IE) [ 2 , 3 ]. Tasks become increasingly harder (i) with less data available for training, (ii) when relationships are increasingly complicated to extract (binary vs n-ary relations) and (iii) the openness of the task: schema-driven vs open IE

Open challenges that were proposed:
1. how to automatically construct “higher-order or higher-ary knowledge”, such as scopes, context, degrees of belief, confidence, and how to evaluate these;
2. how to deal with n to M relations;
3. how do we integrate LMs in the knowledge engineering pipeline;
4. how to deal with bias, trust and control in LMs as KGs; how to add provenance to statements in LMs;
5. how to deal with explainability of answers from prompts;
6. how to update facts in LLMs as KGs;
7. what types of knowledge representations do we extract.

3.6 Human-Centric Knowledge Engineering: Making Knowledge Engineering Trustworthy

Do we know how good the data in the knowledge graph is?
Do we know where the data comes from?
Do we know how to audit our data to make it less biased?
Do we know how the data came about?
Do we know how the data is used?

3.7 Everything is Expensive

The trouble with triples. Single triples cannot express complex statements (known as n-ary statement, but also not frames or events). So patterns of triples are required to representsuch complex statements. But for users of a triple store, these are atomic statements.

3.8 Tools and User Experience for KG Engineering

1) An end user might use knowledge graphs to explore knowledge, browse answers to their questions, or develop new ideas. These tasks can be supported by browsers, visualization tools, or tools for textual and faceted search. Key pain pointers from an end-user perspective are the lack of streamed workflow from high- to micro-level, the lack of user studies, the ambiguity of interface semantics, and issues with compositionality and data quality

3.9 Social and Technical Biases in Knowledge Graphs

Biases in knowledge graphs may originate in the very design of the knowledge graph, in the source data from which it is created (semi-)automatically, and in the algorithms used to sample, aggregate, and process that data.
These source biases typically appear in expressions, utterances, and text sources, and can carry over into downstream representations such as knowledge graphs and knowledge graph embeddings. Furthermore, we also have to consider a large variety of human biases, as e.g. reporting bias, selection bias, confirmation bias, overgeneralization, etc.

Collaboratively built knowledge graphs, as e.g. DBpedia or GeoNames also exhibit social bias, often arising from the western centered world view of their main contributors [ 2]. In addition, some “truths” represented in those knowledge graphs might be considered as controversial or opinionated, which underlines the importance of provenance information.

4.1 Organizing Scientific Contributions in the Open Research Knowledge Graph

We argue for representing scholarly contributions in a structured and semantic way as a knowledge graph. The advantage is that information represented in a knowledge graph is readable by machines and humans. As an example, we give an overview of the Open Research Knowledge Graph (ORKG5), a service implementing this approach. ... As a result, a scholarly knowledge graph such as the ORKG can be used to give a condensed overview of the state-of-the-art addressing a particular research quest, for example as a tabular comparison of contributions according to various characteristics of the approaches.

4.2 dblp as a Knowledge Graph

In its initial release, the dblp knowledge graph forms a simple person-publication graph, consisting (as of October 2022) of more than 3 million person entities, 6.3 million publication entities, and 340 million RDF triples in total. More than 15 million external resource URIs are linked in the data set. Numerous metadata aspects, like journals/conference series or the affiliation of an author, are currently provided only as string literals.

4.3 Triples are not Enough

Abstract Wikipedia aims to cover the whole breadth of knowledge that is in a usual Wikipedia article. Wikidata cannot comfortably represent the kind of knowledge necessary for the natural language text of such a Wikipedia article. We decided to work with two knowledge representations beyond triples: functions, in order to generate natural language text, and frames, in order to capture n-aries and other complex statements

4.4 Making Knowledge Graph Embeddings a First Class Citizen

4.5 Knowledge Graph Completion using Embeddings

4.6 Knowledge Engineering for Semantic Web Machine Learning Systems

4.7 Shifting from a Triple-centric View to a Knowledge Components View in KGs

Tool support and partial automation is essential in today’s Knowledge Engineering (KE) practices. This is true both for creating schemas, e.g. ontologies, and corresponding knowledge graphs. It is rarely the case that a single triple in a KG answers a user’s query, rather, users of knowledge intensive systems most often need much more complex knowledge structures.

Hence, it becomes essential that the knowledge engineering process captures all these end-user relevant
levels of granularity, i.e. not only the triple-level but as more complex knowledge components. Some previous work on ontology design patterns, and recently conceptual components, point in this direction. However, this has not yet been fully brought into KE methodologies, tools, visualisations, and reasoning methods. Even further, when automating parts of the KE methodologies, such as the population of KGs, there is a need for knowledge extraction not only at the triple level, but at the level of detecting and extracting such complex components, e.g. from natural language text, where many open challenges exist.

4.8 A Normative Knowledge Graph for Verified Identity Applications

The Merit Graph maintains metadata about the provenance of statements about relations and entities, and uses that information to establish access control over data in the graph. This metadata supports verifiable and fine-grained policies that are meant to ensure the trustworthiness of the data, as well as to prevent improper sharing of personal data with third parties.

The Merit Graph is formally defined in a way that it can be transformed into a set of logical statements which, combined at processing time with rules, can be used to perform automated reasoning about the data in the graph. Rules are managed as part of the schema associated withthe graph, through user interfaces used by system administrators to establish policies and provide domain expertise for specific use cases.

4.9 Semantic Interoperability at Conceptual Level: Not Easy but Necessary

4.10 Modelling Complex Concepts

The success of AI technologies on standardised benchmark datasets, invites us to move towards more difficult and more complex concepts and tasks. The digital humanities domain presents many opportunities for investigating the recognition and modelling of complex concepts thanks to massive digitisation efforts that have made available large and varied datasets, in multiple modalities. My work now specifically highlights the complexities in modelling a concept such as smell, dealing with its representations in various media, and how the temporal dimension of historical and linguistic research forces us to deal with issues such as changing social norms and our colonial history.

[Seria o projeto sobre cheiros que o Daniel colabora?]

4.11 KG Magic Requires KE Magic

5.1 Integration of Language Models and Structured Data

This group focused on how Large Language Models (LLMs) can be integrated or used for structure data.

5.1.3 Open Research Questions
The high-level questions to consider are:

How do we automatically construct a knowledge graph from structured data?
How do we automatically construct mappings from structured data to a knowledge graph?
Do we even need LLMs for this problem?

[GraphGPT] ...

API-Key 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

Site https://medium.com/@vespinozag/graphgpt-convert-unstructured-natural-language-into-a-knowledge-graph-cccbee19abdf

5.2 Knowledge Engineering with Language Models and Neural Methods

5.3 Explainability of Knowledge Graph Engineering Pipelines

5.4 Construction with Modalities and Types

This working group focused on investigating the gap between what is currently captured in Knowledge Graphs and what information is contained in other sources and modalities. The impetus for this break-out group came from the observation that most current KGs focus on factoid-information that is easily captured in triples such as information about entities and properties.
More freely structured data such as text and images contain information that may be difficult to capture in triple format. Procedural knowledge or other knowledge that has a clear sequence (e.g. word order in text) does not naturally fit into KGs. Solutions such as the NLP Interchange Format (NIF) have been proposed but lead to bulky modelling.
Information concerning more abstract concepts such as opinions or perspectives are often implicit and have a social and contextual dimension – what is acceptable in one context may not be acceptable in another. This type of information intersects with commonsense knowledge as well as social norms.

5.4.3 Next Steps
The discussion in our group was only a first attempt at charting the landscape of the various types of knowledge that can appear in KGs. This effort needs to be continued, especially on:
Surveying of cases and the types of knowledge that can or should be relevant for them.
Working on one or several “spectra” of expressiveness and other dimensions, for the knowledge that can be represented in KGs.
This work, which could be progressed in a workshop-like setting (especially in order to agree on types and dimensions) and long-term community outreach effort (especially for the surveying), should be eventually presented in a written form that can benefit researchers and practitioners on the longer term – either as a separate paper or part of a wider book on knowledge engineering methodology.

5.5 Knowledge Graphs vs. Other Forms of Knowledge Representation

5.5.1 Discussed Problems
How do knowledge graphs relate to other types of knowledge representation?
What kinds of knowledge are knowledge graphs good at representing?
Will knowledge graphs still be needed given the advancements in large language models?
Are knowledge graphs the best target for knowledge extraction from large language models?

With respect to modalities of knowledge representation, we can identify, for example, the following:

Textual: books, literature, rich text, emails
Lexicographical: thesauri, lexemes, vocabulary, dictionaries
Tabular: CSV, spreadsheets, relational tables
Temporal: edit histories, chronologies, stock tickers, temporal databases
Graph: (social/transport/biological) networks, knowledge graphs
Hierarchical: taxonomies, classifications, XML, JSON
Logical: rules, ontologies, first-order logic, frames, scripts, schemas
Procedural: code, instructions, workflows, tutorials
Multimedia: video, audio, images
Diagrammatic: UML, ER, pie charts, Sankey diagrams
Numeric: embeddings, language models, matrices
Mental: human memory, epigenetic memory
Social: word of mouth, gossip, stories, songs, institutional memory

To understand this in more detail, we identify some different types of knowledge:
Factual: expressing declarative statements representing claims of truth (e.g., the capital of Nigeria is Abuja).
Quantified: expressing statements for existential or universally quantified elements (e.g., all countries have a capital).
Contextual: expressing statements that are claimed to be true within a certain context, such as a probability or fuzzy quantification of truth (e.g., a country probably only has one capital); a temporal context (the capital of Nigeria has been Abuja since 1991 ), etc.
Procedural: expressing ways of doing things, often involving a sequence of actions and their effects (e.g., how to prepare the Nigerian dish Tuwo shinkafa).
Narrative: expressing a series of statements building a model and working within that model to communicate knowledge
Tacit: implicit knowledge often gained through lived experience; may involve qualia, such as taste, smell, touch, sight (e.g., what Tuwo shinkafa tastes like); socially-acquired knowledge relating to customs, values, etc. (e.g., that it would be strange to eat Tuwo shinkafa with marmalade), and so forth.
Counterfactual: expressing statements of possible world states, representing what would be true under varying circumstances and often including modal terms such as “possibly” (if I would take the bike, I would possibly not be on time).

Moreover, it is nontrivial how to represent hypothetical knowledge – such as counterfactual knowledge – for which statements can be equally likely depending on different world states. RDF* does allow for contextualised statements without any truth value assigned to them.

These are quoted triples, which are statements not asserted and thus not evaluated in the knowledge graph. However, keeping track of the epistemic status of a contextualised statement, given situational facts, is not yet supported.

Knowledge graphs can be more transparent and can have clearer provenance as they can contain references, sources, or other ways to establish trust in the knowledge in the graph. Conversely, language models do not currently capture the connection between the weights and the textual sources used to learn these weights in a fine-grained way. Relatedly, knowledge graphs allow for more explainability than large language models. With a symbolic system we can display the involved ground statements, and the inferences that took place, whereas with language models, generating explanations is a very popular and challenging topic of active research [4].
Knowledge graphs also cover the long tail better, and can be more easily extended to cover the long tail. A naive approach to increasing coverage for a language model is to retrain or refine it with more text about the topics to be covered; in a structured knowledge base you can just explicitly add the required structure. Anecdotal experience indicates that if we want to increase coverage of, e.g. different file types, we can either write or search for documents about these file types – and writing a new document may take dozens of minutes if not hours – or we can create a new item in a knowledge base, which may take half a minute.

5.6 Bias in Knowledge Graph Systems

The starting point of this discussion was the overview lecture on social and technical bias in knowledge graphs presented by Harald Sack. Bias often is characterised as a disproportionate weight in favour of or against a person, group, an idea or thing, usually in a way that is considered closed-minded, prejudicial, or unfair, especially one that is preconceived or unreasoned. Biases in Knowledge Graphs (KGs) as well as potential means to address them are different from those in other AI systems, as e.g. in large language models or in image classification. KGs store human knowledge about the world in structured format, e.g., triples of facts or graphs of entities and relations, to be processed by AI systems. In the past decade, extensive research efforts have gone into constructing and utilising KGs for tasks in natural language processing, information retrieval, recommender systems, and many more. In difference to language models and image classification systems, KGs are sparse, i.e. typically only a small number of triples exist per entity. Once constructed, KGs are often considered as objective and neutral reference data sources that safeguard the correctness of other systems. In reality this is often not the case, since KGs are created with specific application context in mind. This has the undesirable effect that biases inherent to KGs may become magnified and spread through KG based systems (Bias Network Effect).

Data Bias: Bias may be already inherent in the source data from which the KG is created in an automated or semi-automated way. For KGS that are collaboratively created or based on collaboratively collected information, all forms of human biases might be already incorporated. Furthermore, bias can also be introduced by the algorithms used to sample, aggregate, and process that data

Schema Bias: Bias may be introduced via the chosen ontology as the basis for a KG, or simply be embedded within ontologies. Most times, ontologies are developed in a top-down manner with application needs or certain philosophical paradigms in mind Typically defined by a group of knowledge engineers in collaboration with domain experts, ontologies consequently (though often implicitly) reflect the worldviews and biases of the development team (human bias and anthropocentric thinking). In addition, the ontology and its modelling often depends on the chosen representation language, i.e. typically a fragment of DL, and not the other way around.

Inferential Bias: Inferential biases in KGs arise at inferencing level, such as reasoning, querying, or rule learning.

5.6.1 Discussed Problems
Bias as a signal problem
One way in which representation bias might surface in knowledge graphs is that information which can be inferred is not explicitly represented in a knowledge graph. For example, the relation is married to is symmetric, and from A is married to B, one can infer that B is married to A also holds. From a logical standpoint, it is therefore sufficient to encode one of the two statements in the knowledge graph.
In [2], it was reported that a vast majority of is married to relations in DBpedia are only present in one direction, and there are far more statements where the subject is female and the object is male than vice versa [ 3]. This can be considered a gender-related representation bias in the knowledge graph, since the editors (of Wikipedia infoboxes, which DBpedia is created from) find this information more noteworthy for females than for males.

Bias as a context problem
Bias as an ethical and societal problem is another important aspect, rooted in the context of the knowledge graph, since the knowledge graph cannot be generated without context, which is usually implicit. Typical examples include political and cultural statements. The serious issue of such ethical/societal bias can be exacerbated by the naive use of a knowledge graph, and may even cause the denial of a whole knowledge graph (for example, some nations forbid the use of Wikipedia). Data and knowledge graph quality methodologies and methods are also typically context-driven. Questions to explored include understanding the relationships between bias and quality of knowledge graphs.

Handling bias
Documenting bias is a first step to handling bias, but it is not the end of the line. Depending on the requirements and task at hand, different ways of further handling bias are possible. Applying negotiation protocols is an option for dealing with conflicting information, but may not be possible for truly controversial information. In such cases, the authors of [ 6] suggest allowing controversial information with additional metadata. Depending on the task at hand, bias may also be removed or handled by means of resampling methods. However, as the experiment reported above shows, this might not always be an efficient method.

5.7 Generating User and Developer Buy-in

The following pain points were identified:
Getting incorrect answers to queries
Dislike of identifiers, especially opaque identifiers
Schema discovery: The same data can often be modeled in different ways in a knowledge graph. To write queries that give them the answers they need, developers need to first get an understanding of how the data they are interested in is modeled. This can be challenging, especially if exploratory tools are not at hand.
Adapting to new interfaces:
Unclear and unhelpful error messages:

5.8 A Core Knowledge Engineering Methodology for Knowledge Graphs

Conclusions and Open Questions

The knowledge graph life cycle was a focal point of discussion. There was consensus that we need a sustained effort to update and upgrade classical ontology engineering methodologies [ 2] and develop end-to-end open-source infrastructure to make the most of the latest neurosymbolic technologies and tools, hence taking knowledge engineering and knowledge graphs beyond structured and semi-structured data to other modalities.

In conjunction with machine learning, knowledge graphs are also used in semantic search, zero-shot learning, dialogue systems and recommender systems as a source of knowledge and explanations. Some of the best known knowledge graphs today, for instance in web search (Google, Microsoft), social networks (LinkedIn), and intelligent assistants (Siri, Alexa) achieve scales that were inconceivable decades ago – this is possible only with the help of automation, in particular using the latest developments in machine learning including generative models pre-trained on huge amounts of online data.

To continue the conversation, we provided organizers of EKAW 2022 the 23rd International Conference on Knowledge Engineering and Knowledge Management input for a walkshop

6.2 Open Questions

Comentários

Veronica dos Santos16 de julho de 2023 às 14:53
Sobre "tipos/categorias" de conhecimento (conversa com o Hermann e Sérgio em 13/07/23)

(1) Factual (que deve estar associado ao contexto da proveniência, pelo menos), (2) Contextual (que deve ser associado a contextos específicos, além da proveniência) e (3) Counterfactual (que deve ser associado a contextos específicos, além da proveniência, deixando claro que existem diferentes perspectivas e circunstâncias onde a veracidade e a utilidade da informação pode ser avaliada, cabendo a arbitragem).

Os exemplos de Terra Plana, Urna Eletrônica, Vacinas e outros temas que acabam sendo alvo de Fake News me parecem associados a categoria Narrative que seriam suportados por um conjunto de alegações. Mas neste caso não sei se tornar o Contexto explícito seria o suficiente para confrontar as narrativas e permitir decidir qual seria "a Verdade" em abordagens de Fact Checking.

Daniel considera que esta classificação pode ser usada como referência porém contexto se aplicaria a qualquer tipo de conhecimento.
ResponderExcluir
Respostas

Adicionar comentário

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Pesquisa de Doutorado da Veronica

Pesquisar este blog