Keynote on Youtube -> https://youtu.be/DZ6NlcW4YV8?si=4Z5zDA1Vx_D10GKz
No Intelligence Without Knowledge
Katja Hose
TU Wien, Austria
Abstract. Knowledge graphs and graph data in general are becoming more and more essential components of intelligent systems. This does not only include native graph data, such as social networks or Linked Data on the Web. The flexibility of the graph model and its ability to store data relationships explicitly enables the integration and exploitation of data from very diverse sources. However, to truly exploit their potential, it becomes crucial to provide intelligent systems with verifiable knowledge, reliable facts, patterns, and a deeper understanding of the underlying domains. This talk will therefore chart a number of challenges for exploiting graphs to manage and bring meaning to large amounts of heterogeneous data and discuss opportunities with, without, and for artificial intelligence emerging from research situated at the confluence of data management, knowledge engineering, and machine learning
Knowledge Engineering in the Era of Artificial Intelligence
Katja Hose(B)
TU Wien, Vienna, Austria
katja.hose@tuwien.ac.at
Abstract. Knowledge engineering with respect to knowledge graphs and graph data in general is becoming a more and more essential component of intelligent systems. Such systems benefit from the wealth of structured knowledge, which does not only include native graph data, such as social networks or Linked Data on the Web, but also general knowledge describing particular topics of interest. Furthermore, the flexibility of the graph model and its ability to store data relationships explicitly enables the integration and exploitation of data from very diverse sources. Hence, to truly exploit their potential, it becomes crucial to provide intelligent systems with verifiable knowledge, reliable facts, patterns, and a deeper understanding of the underlying domains. This paper will therefore chart a number of current challenges in knowledge engineering and discuss opportunities.
1 Introduction
Most recently, large language models, and in particular ChatGPT, have gained a lot of attention. Obviously, it is very appealing to simply formulate questions in natural language and receive elaborate and detailed replies that explain an extremely broad range of complex topics. While this system seems to be intelligent, it suffers from a similar problem as other large language models and machine learning approaches in general: the answer it returns is the most probable answer, it cannot be certain about its correctness. In the context of ChatGPT the latter is commonly referred to as hallucinations [11], i.e., the
answer does not necessarily reflect reality but can be “made up”.
[LLMs chutam respostas plausíveis]
2 Modeling and Storing Knowledge
But there are also interoperability issues between the different graph models, query languages, and standards that hamper efficient use of graph data.
[Mas também questões de expressividade do modelo]
When converting all data into an integrated knowledge graph directly, it can be queried in a single system – not only with standard queries. There are some works [23,51] on setting up semantic data warehouses incl. spatio-temporal extensions.
[Contexto de Localização e Temporal é importante em KG pq o conhecimento muda]
An interesting observation here is that publishing data and making it available in this way is very easy as the publishers do not need to conform to a common integrated schema. However, this comes at the
expense of query formulation and optimization, which then is considerably more complex. To formulate a query, users themselves have to know how the information in the different sources are connected – whereas this would typically be done when defining a common schema or table in a traditional relational database scenario.
[Schema on read ... tratar as inconsistências e incompletudes]
3 Querying Knowledge
The way in which knowledge is queried very much depends on the chosen data model and the way the data is physically stored.
However, many users are not familiar with the details, content, and schema of a knowledge graph and therefore have difficulties formulating structured queries.
[Nem sempre o KG tem schema conhecido ou um schema único]
To help such users, the literature has proposed exploratory query techniques and the query-by-example paradigm [44,45]. In this case, users do not formulate structured queries directly but provide the system with examples of potential answers – the system then tries to reverse engineer a query from the desired output, executes it, and presents the results to the user who can then iteratively refine the query until the information need is met. This is even possible for complex setups incl. analytical queries over statistical knowledge graphs [43].
Exploratory techniques for knowledge graphs cover a broad range of methods that include data profiling [1] as well as skyline queries [39].
Keles, I., Hose, K.: Skyline queries over knowledge graphs. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11778, pp. 293–310. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30793-6
[Skyline query seria o mesmo que star-query?]
Assuming that the user was able to formulate a structured query that expresses the information need ....
The OneGraph vision [41], for instance, sketches a scenario where the data model no longer determines the query languages and would allow formulating Cypher queries over an RDF store.
[Modelo do Amazon Neptune 1G -> https://versant-pesquisadedoutorado.blogspot.com/2022/01/graph-yes-which-one-help-leitura-de.html]
4 Knowledge Quality and Metadata
Nevertheless, while OWL and RDFS have been developed for capturing the meaning of data by defining proper classes, hierarchies, and constraints, SHACL has been proposed more recently as a standard to define constraints on the structure of knowledge graphs – without the need to define a proper full-fledged ontology and capture the meaning of the data. SHACL allows to define graph patterns, referred to as shapes, along with constraints that subgraphs matching the patterns/shapes should fulfill. While SHACL is becoming more and more adopted by the community, it still remains a challenge to avoid having to define shapes manually [56] but instead being offered semi-automatic solutions for cre-
ating them given a knowledge graph as input.
[SHACL é usado em tempo de inserir/atualizar informações ou de verificar/validar o conteúdo do KG. Como SHACL poderia ser usado em tempo de consulta? Acionar o reasoning para completar as respostas?]
While mining shapes from large knowledge graphs meets scalability issues, it is also important to mine meaningful shapes [57] and avoid spurious ones, i.e., those that do not occur frequently or are fulfilled by only a small proportion of matching subgraphs. Once determined, such shapes can not only be used to create validation reports but they can also be used in a more interactive fashion in a similar way as mined association rules [20], e.g., to help experts find outliers and erroneous information so that the data can be corrected and the quality can be improved [58]
[Mineração de Regras e Padrões em KG Profiling -> https://versant-pesquisadedoutorado.blogspot.com/2023/02/rule-mining-with-amie-trabalho.html]
Another way of improving quality and trust in knowledge is to provide metadata. While metadata in property graphs can be expressed by adding attributes to nodes and edges, this is not straightforward for knowledge graphs. The latter require special constructs, such as reification, singleton properties [52], named graphs [13], or RDF-star.
While reification leads to a large increase in the number of triples (because subject, predicate, and object of the original triple are separated into their own triples), singleton properties (instantiating a unique subproperty for each triple with metadata) and named graph solutions (in the worst case creating a separate named graphs for each single triple) typically also suffer from scalability issues and require verbose query constructs since existing engines are not designed to efficiently support such use cases.
[Dificuldades em representar contexto com reificação]
On the other hand, RDF-star is proposing to nest triples, i.e., to use a complete triple on subject or object position of another triple. While this is very elegant from a modeling perspective, it poses several challenges on data organization and querying since nesting has not yet been a typical requirement. Still, many triple stores do already support RDF-star so that it can already be used in practice.
Provenance, in the sense of explaining the origin of data, is an important kind of metadata. In this sense it is often desired to capture information about who created the data, how and when it was obtained, how it was processed, etc. In RDF, such workflow provenance [19,24] can for instance be encoded using the PROV-O ontology, which offers several classes with well defined meaning for this purpose. Another type of provenance, how-provenance [21,28], describes how an answer to a particular query was derived from a given input dataset. This approach allows to directly trace down the input tuples/triples/edges that were combined to derive a particular answer to a query – in addition, how-provenance also returns a polynomial describing how these tuples/triples/edges have been combined for a given query answer. In general, all flavors of provenance help explain answers to structured queries and in doing so increase the trust users can have in a system. To the best of our knowledge, however, there is currently no system for knowledge graphs combining workflow provenance with how-provenance.
[Confiança estaria diretamente ligada a Proveniência e a capacidade de explicar a geração dos resultados]
Incompletude -> https://youtu.be/DZ6NlcW4YV8?si=mpJoAroHpvEBjTOq&t=2377
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.