Hofer, M., Obraczka, D., Saeedi, A., Kopcke, H., & Rahm, E. (2023). Construction of Knowledge Graphs: State and Challenges. ArXiv, abs/2302.11509.
Abstract. ... In this work we first discuss the main graph models for KGs and introduce the major requirement for future KG construction pipelines. We then provide an overview of the necessary steps to build high-quality KGs including cross-cutting topics such as metadata management, ontology development and quality assurance. We then evaluate the state of the art of KG construction w.r.t the introduced requirements for specific popular KGs as well as some recent tools and strategies for KG construction. Finally we identify areas in need of further research and improvement.
1. Introduction
2. KG background and requirements for KG construction
2.1. Knowledge Graph
KGs realize a physical data integration where the information from different sources is combined in a new logically centralized graph-like representation. KGs are schema-flexible and the graph structure allows a relatively easy addition of new entities and their interlinking with other entities. This is in contrast to the use of data warehouses as a popular approach for physical data integration. Data warehouses focus on the integration of data within a structured (relational) database with a relatively static schema that is optimized for certain multi-dimensional data analysis. Schema evolution is a manual and tedious process making it difficult to add new data sources or new kinds of information not conforming to the schema.
[Schema On Read x Schema On Write, Schema Full x Schemaless]
Ehrlinger et al. [16] give a comprehensive overview of KG definitions and provide their own: "A knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge." Hogan et. al. [18] argue that this definition is very specific and excludes various industrial KGs which helped to popularize the concept.
[Definição sobre KG não é consensual]
2.2. Graph Models
Support for integrity constraints is also desirable to automatically control the consistency and therefore quality of graph data to some extent.
[WD tem suporte mas não é enforced]
Furthermore, it should be possible to represent annotating metadata of KG entities, e.g., about their origin and transformation during KG construction.
[Contexto de Proveniência]
Additionally, it is desirable to reflect the development of the KG over time so that a temporal KG analysis is supported. This can be achieved by a temporal graph data model with time metadata for every entity and relation and temporal query possibilities, e.g., to determine previous states of the KG or to find out what has been changed in a certain time interval. The temporal development of a KG might alternatively be reflected with a versioning concept where new KG versions are periodically released.
[Contexto Temporal ... ou em partes Proveniência também]
Although, RDF-Star greatly improves the formal meta expressiveness of RDF, specific cases are still not presentable as in PGM, without utilizing support constructs. In the PGM we can have two equally named but independently addressable relations between two entities, both with individually resolvable edge properties and the issue of interference. However, in RDF-Star, triples (relations) always identify based on their comprised elements, and it is not possible to attach distinguishable sets of additional data to equally named relations without overlapping or utilizing support constructs [46].
[Limitação do RDF-Star em comparação com o LPG]
The PGM has become increasingly popular for advanced database and network applications (e.g., to analyse social networks) but its limited ontology support has so far hindered its broader adoption for KGs.
[Limitação do LPG em comparação com o RDF]
The Amazon Neptune7 database service allows users to operate PGM and RDF interchangeably. Hartig et al. [48] and Abuoda et al. [49] discuss transformation strategies between RDF and PGM to lower usage boundaries.
[1G da Amazon ... e modelo multi layer graph]
2.3. Requirements of KG construction
[Ainda requer intervenções manuais como o desenvolvimento da Ontologia, seleção de fontes, ...]
Quality Assurance. Quality assurance is a cross-cutting topic playing an important role throughout the whole KG construction process. Quality problems in the KG can be multi-faceted relating to the ontological consistency, the data quality of entities and relations (comprehensiveness), or domain coverage.
[E a incompletude em relação ao Contexto]
3. Construction Tasks
- Metadata Management: Acquisition and management of different kinds of metadata, e.g., about the provenance of entities, structural metadata, temporal information, quality reports or process logs.
– Ontology Management: Creation and incremental evolution of a KG ontology.
[Estabelecer na Ontologia quais as dimensões de contexto das relações e obter estas informação junto as fontes de dados no momento da extração]
3.1. Data Acquisition & Preprocessing
3.1.1. Source Selection & Filtering
Selecting relevant data sources and their subsets are typically manual steps but can be supported by data catalogs providing describing metadata about sources and their contents. Common approaches to determine such metadata are to employ techniques for data profiling, topic modelling, keyword tagging, and categorization [55, 56]
[Conhecer a fonte de onde será extraída a informação ... mas quem vai usar a informação também precisaria confiar nela!]
3.1.2. Data Acquisition
3.1.3. Transformation & Mapping
3.1.4. Data Cleaning
Data cleaning deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Whenever possible, data quality problems within the input sources should be handled already during the import process to avoid that wrong or low-quality data is added to the KG. Data cleaning has received a large amount of interest, especially for structured data, in both industry and research and there are numerous surveys and books about the topic, e.g. [76–79]. There are many different kinds of data errors and quality problems to handle such as missing or wrong data values (e.g., due to typos), inconsistent value pairs (e.g. zip code and city), mixing several attribute values in a single freetext attribute (e.g. address or product information), duplicate or redundant information etc. Typically, data cleaning involves several subtasks to deal with these problems, in particular data profiling to identify quality problems [80], data repair to correct identified problems, data transformation to unify data representations and data deduplication to eliminate duplicate entities...
[Mas nem tudo será erro, pode ser somente divergência em função de multiplos pontos de vista e/ou fontes e/ou níveis de granularidade]
3.2. Metadata Management
Metadata describes data artifacts and is important for the findability, accessibility, interoperability and (re)usability of these artifacts [12, 83, 84]. There are many kinds of metadata in KGs such as descriptive metadata (content information for discovery), structural metadata (e.g. schemas and ontologies), and administrative metadata concerning technical and process aspects (e.g., provenance information, mapping specifications) [85–87]. It is thus important that KG construction supports the comprehensive representation, management and usability of the different kinds of metadata. From the perspective of KG construction pipelines, this includes metadata for each data source (schema, access specifications), each processing step in the pipeline (inputs including configuration, outputs including log files and reports), about intermediate results and of course the KG and its versions.
Moreover, for each fact (entity, relation, property) in the KG there can be metadata such as about provenance, i.e., information about the origin of data artifacts. Such fact-level provenance is sometimes called deep or statement-level provenance. Examples of deep provenance include information about the creation date, confidence score (of the extraction method) or the original text paragraph the fact was derived from. Such provenance can help to make fact-level changes in the KG without re-computing each step or to identify how and from where wrong values were introduced into the KG [84]
[Proveniência e Temporal. O espacial pouco aparece. EM caso de construção automatizada do KG é importante ter a métricas de confiança da geração as triplas.]
To make best use of metadata for KG construction asks for the use of a metadata repository (MDR) to store and organize the different kinds of metadata in a uniform and consistent way [83].
[Separado do KG mas na hora de consultar deve ser recuperado junto para montar a Melhor Resposta Possível. Na WD os metadados sobre quem/quando criou as alegações ficam separados]
Fact-level metadata (or annotations) in the KG can be stored either together with the data items (embedded metadata) or in parallel to the data and referenced using unique IDs (associated metadata) [87]. For example, fact-level metadata can support the selection of values and sub-graphs [90], or the compliance to used licenses in target applications. Such annotations are also useful for other kinds of metadata. Temporal KGs can be realized by temporal annotations to record the validity time interval (period during which a fact was valid) and transaction time (time when a fact was added or changed) [34, 44]. The possible implementations for fact-level annotations depend on the used graph data model (see Section 2.2)
[Na WD os metadados de proveniência (References) e temporais (Qualifications) estão associados a nível de cada alegação]
3.3. Ontology Management
3.3.1. Ontology Learning
3.3.2. Ontology/schema matching
3.3.3. Ontology Integration
3.4. Knowledge Extraction
Knowledge extraction is a process to obtain structured, more computer-readable data from unstructured data such as texts or semi-structured data, like web pages and other markup formats. ... The main steps of text-based knowledge representation are named-entity recognition, entity linking, and relation extraction.
3.5. Entity Resolution and Fusion
Entity resolution (ER), also called entity matching, deduplication or link discovery, is a key step in data integration and for good data quality. It refers to the task of identifying entities either in one source or different sources that represent the same real-word object, e.g., a certain customer or product.
[Isto tem relação com o conceito de Identidade do CKG. A identidade deve ser parte da Melhor Resposta Possível]
3.5.1. Incremental Entity Resolution
Entity resolution is challenging due to the often limited quality and high heterogeneity of different entities. It is also computationally expensive because the number of comparisons between entities typically grows quadratically with the total number of entities. ... For incremental ER the task is to match sets of new entities from one or several sources with the current version of the KG which is typically very large and contains entities of different types. It is thus beneficial to know the type of new entities from previous steps in the KG construction pipeline so that only KG entities of the same or related types need to be considered.
[Para cada entidade da entrada é necessário comparar com todas existentes no KG corrente para saber se deve ser criada uma nova ou se a mesma já existe. Quais as técnicas de "comparação"? Agrupar/Clusterizar as entidades. Os atributos e relações usados nesta comparação poderiam ser considerados a Identidade da Entidade ou somente seu ID? Estas "features" são explicadas durante o processo de clusterização ou o algoritmo só gera uma métrica de similaridade sem indicar quais elementos foram usados?]
3.5.2. Entity Fusion
Merging multiple records of the same real-world entity into a single, consistent, and clean representation is referred to as data fusion [165]. This is a main step in data integration as it combines information from several entities into one enriched entity. Data fusion still entails resolving inconsistencies in the data. First the records may disagree on the names of matching attributes so that one preferred name has to be chosen that should be the consistent with the attribute names of other entities of the same type to facilitate querying. Furthermore, the matching records can disagree on the values of an attribute. There are three main strategies to handle such attribute-level inconsistencies or conflicts [165]:
– Conflict Ignorance: The conflict is not handled but the different attribute values may be retained or the problem can be delegated to the user application.
– Conflict Avoidance: It applies a unique strategy for all data. For example, it prioritizes data from trusted sources over others.
– Conflict Resolution: It considers all data and metadata before applying a decision to apply a specified strategy, such as taking the most frequent, most recent or a randomly selected value.
Such techniques were first applied for relational data but also found use for Linked Data fusion [166].
[Pode ser aplicado em tempo de construção do KG (abordagem a priori)]
[Poderia ser acrescentado a Melhor Resposta Possível? Ignorar o conflito em tempo de construção mas sinalizar em tempo de resposta a consultas]
[E se não for um Conflito e sim múltiplas perspectivas onde o consumidor é quem deve escolher o que será considerado verdadeiro e útil? ... camada de confiança. A camada de confiança deve ser resiliente aos conflitos, quando existirem.]
3.6. Quality Assurance
Quality improvement aims at fixing or mitigating the detected quality issues by refining and repairing the KG. This encompasses inferring and adding missing knowledge to the graph, or identifying and repairing erroneous pieces of information in order to improve data quality. Quality evaluation is not only important for the resulting KG as outcome of the KG construction process but also within the different construction tasks, such as data cleaning for acquired data, knowledge extraction, ontology evolution or entity fusion.
[Completar o KG com as informações de Contexto faltantes. Usar log de consultas para priorizar obter novas fontes para complementar as informações de contexto faltantes para alegações existentes.]
3.7. Knowledge Completion
Knowledge Graph completion is the task of adding new entries (nodes, relations, properties) to the graph using existing relations.
3.7.1. Type completion
3.7.2. Link Prediction
3.7.3. Data Enrichment & Polishingh
For enhancing KG data with additional relevant domain entities information external knowledge bases can be requested based on extracted (global) persistent identifiers (PID). For example extracted ISBN numbers, DOIs, or ORCIDs allow to request additional external information from Wikidata; ... Paulheim surveys approaches that exploit links to other KGs in order to not only verify information but also to find additional information to fill existing gaps [15].
[15] H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semantic web 8(3) (2017), 489–508.
[IDs externos não seriam somente Claims/Triplas para Entidades mas também fontes de informação adicional. Como referência sugere que o Claim foi obtido na fonte externa e replicado na WD]
4. Overview of Knowledge Graph Construction Pipelines and Toolsets
In our selection we try to cover popular KGs such as DBpedia and Yago as well as more current approaches for either a single domain or several domains (cross domain). Most importantly, we focus on KG projects described in peer-reviewed articles and discuss closed KGs only briefly as their data is not publicly accessible and the used techniques are not verifiable. Such closed KGs are typically developed and used in companies such as company-specific Enterprise KGs [213] and the KGs of big web and IT companies such as Google [195], Amazon [196], Facebook, Microsoft [214], Tencent, or IBM.
[Abertos ou Fechados, a questão das múltiplas perspectivas permanece. Como resolvem os conflitos? E como resolvem a ausência de contexto]
Table 4 - Comparison of KG construction approaches ... Metadata: Provenance, Temporal e Outros
Wikidata allows the annotation of entities by key-value pairs with a validity time, provenance, and other meta information such as references [216]. As a Wikimedia project, full data dump snapshots are released twice a month.
The ORKG [205] focuses on publications where manually uploaded papers are automatically enriched with metadata. The platform provides tools to extract information such as tables and figures from publications and to help find and compare similar publications of interest.
[Caso de KGs abertos que trazem os mesmos desafios]
Collected Metadata. We consider whether deep or fact-level provenance, temporal information (e.g, validity time) and additional metadata such as aggregated statistics, process reports, or descriptive and administrative information are collected and added to the KG or a separate repository. The acquisition of provenance data is the most common kind of metadata support and ranges from simple source identifiers and confidence scores up to the inclusion of the original values. Several systems maintain temporal metadata while further metadata is hardly supported or at least not described.
[Contexto seria sempre Metadado ou também seria Dado? Como separar automaticamente se tudo fizer parte do KG? Na WD por exemplo é possível separar qualificadores e referencias usando prefixos. ]
Construction Tasks
Entity Resolution - this task is supported in only by few approaches and the pipelines that do employ ER tend to use sophisticated methods like blocking to address scalability issues (ArtistKG, SAGA), and machine-learning-based matchers (SAGA)
[ML não "constumam" ter uma explicação para quais atributos são usados!!!]
Entity Fusion - this is the least supported task in the considered solutions. None of the dataset specific KGs performs classical (sophisticated) entity fusion in the manner of consolidating possible value candidates and selecting final entity ids or values. Instead, the final KG often contains a union of all extracted values, either with or without provenance, leaving final consolidation/selection to the targeted applications.
[A maioria usa a abordagem de Evitar Conflito deixando para a aplicação decidir sobre os atributos e relações da entidade.]
4.2. KG Specific Solutions
4.3. KG Frameworks & Strategies
SAGA [47]. This closed-source toolset supports multi-source data integration for both batch-like incremental KG construction and continuous KG updates. The internal data model extends standard RDF to capture one-hop relationships among entities, provenance (source), and trustworthiness of values. The system supports source change detection and delta computation using their last snapshots. Based on detected changes, SAGA executes parallel batch jobs to integrate an updated or new source into its target graph. SAGA’s ingestion component requires mappings from new data to the internal KG ontology.
This step only requires predicate mappings, as the subject and object fields can remain in their original namespace and are linked later in the process. The required mappings are mostly manually defined and stored as supplementary configuration files.
Additionally, data can be reprocessed with the HoloClean tool [81] for data repair. SAGA is able to detect and disambiguate entities from text and (semi-)structured sources. To make the deduplication step scalable it groups entities by type and performs simple blocking to further partition the data into smaller buckets. A matching model computes similarity scores and machine-learning- or rule-based methods are applicable to determine likely matches. Correlation clustering [241] is then utilized to determine matching entities.
The system tracks same-as links to original source entities to support debugging. For entity fusion, (conflicting) entity attribute values are harmonized based on truth discovery methods and source reliability to create consistent entities.
In addition to the stable KG (updated in batches), the system can maintain a Live Graph, which continuously integrates streaming data and whose entities reference the stable entities of the batch-based KG. For scalability and near-real-time query performance, the live graph uses an inverted index and a key-value store. SAGA supports live graph curation by using a human-in-the-loop approach. The authors mention that SAGA powers question answering, entity summarization, and text annotation (NER) services.
[SAGA - Ihab Ilyas]
[47] Ihab F. Ilyas, Theodoros Rekatsinas, Vishnu Konda, Jeffrey Pound, Xiaoguang Qi, and Mohamed Soliman. 2022. Saga: A Platform for Continuous Construction and Serving of Knowledge at Scale. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 2259–2272. https://doi.org/10.1145/3514221.3526049
During fusion, we use standard methods of truth discovery and source reliability methods [24, 25, 39, 67] to estimate the probability of correctness for each consolidated fact. These algorithms reason about the agreement and disagreement across sources and also take into account ontological constraints. The associated probability of correctness is stored as metadata in the KG and used by downstream tasks such as targeted fact curation (see Section 6).
[SAGA usa Resolução de Conflito para descobrir a Verdade usando o nível de confiança das fontes]
5. Discussion & Open Challenges
Data and metadata management. Good data and metadata management is vital in an open and incremental KG process. Only a few solutions even mention an underlying management architecture supporting the construction processes. Having uniform access or interfaces to data and relevant metadata can drastically improve the quality of the former [83] and increases the workflow’s replicability and possibilities for debugging. A dedicated metadata repository can store used mappings, schemata, and quality reports, improving the transparency of the entire pipeline process.
[Mapeamento do KG para o CKG poderia ficar no MDR junto com o esquema do KG - mapeamento com o schema]
[Mas e se o mapeamento for a nível de instância para otimizar a consulta? ]
[Existe separação entre esquema e instância no KG? Lógicamente não mas pode ser separado físicamente e associado por URI]
Metadata support is limited in current solutions and only some pipeline approaches acknowledge the importance of provenance tracking and debugging possibilities. We found that the term provenance is rather vaguely used, mostly in the meaning of tracking the source of facts and the processes involved in their generation. Only few approaches such as SAGA [47] also try to maintain the trustworthiness of facts. Metadata such as fact-level provenance should be used more to support construction tasks, such as for data fusion to determine final entity values.
In general there is a need for maintaining more metadata, especially temporal information, that is also essential for studying the evolution of KG information. Support for developing temporal KGs maintaining historical and current data, compared to the common sequences of static KG snapshot versions, is also a promising direction.
[50] Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. 2016. A survey on truth discovery. ACM Sigkdd Explorations Newsletter 17, 2 (2016), 1–16.
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.