Integration of Scholarly Communication Metadata Using Knowledge Graphs

Integration of Scholarly Communication Metadata Using Knowledge Graphs - Leitura de Artigo

Sadeghi A., Lange C., Vidal ME., Auer S. (2017) Integration of Scholarly Communication Metadata Using Knowledge Graphs. In: Kamps J., Tsakonas G., Manolopoulos Y., Iliadis L., Karydis I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol 10450. Springer, Cham. https://doi.org/10.1007/978-3-319-67008-9_26

Abstract

Important questions about the scientific community, e.g., what authors are the experts in a certain field, or are actively engaged in international collaborations, can be answered using publicly available datasets. However, data required to answer such questions is often scattered over multiple isolated datasets.

Recently, the Knowledge Graph (KG) concept has been identified as a means for interweaving heterogeneous datasets and enhancing answer completeness and soundness.

We present a pipeline for creating high quality knowledge graphs that comprise data collected from multiple isolated structured datasets.

We conducted an experimental study on an SCM-KG that merges scientific research metadata from the DBLP bibliographic source and the Microsoft Academic Graph.

The observed results provide evidence that queries are processed more effectively on top of the SCM-KG than over the isolated datasets, while execution time is not negatively affected.

1 Introduction

A data source can be rich in one aspect and insubstantial in other aspects.

For example, the DBLP computer science bibliography database gathers ample information about publications in specific conferences but has sparse data about their keywords and no data about citations. Furthermore it lacks metadata on publications in different fields of research. The Microsoft Academic Graph fills these gaps but is less complete in every scientific field. We claim that collecting research communication metadata from heterogeneous sources and integrating them in a queryable environment not only leads to a more robust knowledge base but also, thanks to increased completeness, enables more effective data analysis.

In this work, we created an integrated graph of scientific knowledge from DBLP and the Microsoft Academic Graph and describe the challenges in matching, linking and integrating the datasets and our approach to addressing these challenges as a methodology that can be reused to build similar knowledge graphs. We present the application of semantic structure based similarity measures in instance matching and show that traditional linking frameworks such as Silk are capable of linking with high relative precision and recall, when they consider data semantics during the linking process.

2 Motivating Example

In the latest DBLP version of April 2017, there are four authors named “Christoph Lange”, indexed 0001 to 0004. When one of these four persons publishes a new article, the maintainers of DBLP face the challenge of linking the article to the right person using his affiliation but DBLP keeps only the current affiliation. By matching authors’ publications and recent affiliations, we can link DBLP authors to MAG authors. Now, an old, unindexed publication by a researcher named “Christoph Lange” can be matched against the author and affiliation information in the unified knowledge graph and linked to the correct person entity – at least when no two different persons published at the same institution at different times. This example shows how combining multiple available data sources can solve an ambiguity problem.

** A relação de filiação depende do contexto temporal. **

3 SCM Knowledge Graph Concept

In this section, we first define basic principles of knowledge graphs and then our notion of a scholarly communication metadata knowledge graph (SCM-KG).

Identification. A key prerequisite for a knowledge graph is to uniquely identify things. All entities of interest should be uniquely identified by Universal/International Resource Identifiers (URI/IRI).

Representation. We need to ensure that information about these things can be easily understood by different parties. The W3C Resource Description Framework (RDF) has meanwhile evolved into the lingua franca of data integration.

Integration. For data exchange in a digitized domain to scale, organizations and involved people need to develop a common understanding of the data. Vocabularies define common concepts (classes) and their attributes (properties) and assign unique identifiers to them.

Coherence. ... transformation techniques for the RDF data model have been standardized by the W3C.

** Triplificação, padrões RDB2RDF e CSV2RDF **
Access.
Coverage. ... incompleteness justifies the need for an integrated knowledge graph.

Identification is provided by a scholarly schema such as ORCID for authors, DOI for articles and books, or ISBN for books.

Common RDF-based vocabularies for knowledge integration include those from the SPAR family of ontologies

** Padrão LOD com reuso de ontologias conhecidas ... The Semantic Publishing and Referencing Ontologies, a.k.a. SPAR Ontologies, http://www.sparontologies.net/ **

4 Building a Knowledge Graph

As input of SCM-KG-PIP, heterogeneous data arrives in different formats, such as CSV, RDF, web pages, or data returned by calling Web APIs. Our approach results in a high-quality, queryable semantic knowledge graph, using a unified schema.

4.1 Data Acquisition

Data available in heterogeneous sources can be obtained in different ways. When they are available as structured dumps, e.g., as CSV, SQL or RDF, their structure may not match the target ontology. ...

Data from Web APIs, another source of structured data, can be collected by gradual querying. Usually, the number of API calls in a specific time window is limited; therefore, throttling has to be applied to requests. ...
When structured data is not provided through open interfaces, one may be forced to resort to scraping data from web pages. Currently, Google Scholar and ResearchGate, two highly relevant sources of data about authors’ current affiliations and recent publications, do not provide ways to access metadata other than by web scraping.

** Diversidade de fontes e dificuldades de acesso **

4.2 Ontology Engineering

Different structured data sources may use different schemas, e.g., DBLP and MAG model the same concepts (e.g., affiliation) differently. Creating an integrated knowledge graph requires a mapping step to accommodate these differences, e.g., that can model both an author’s current affiliation and earlier
ones.

Challenges in integrating occur with structured datasets whose schemas model the same concept in a way different from the ontology of the knowledge graph existing so far. ... As Nguyen describes, a conflict on the concept level occurs when classes with same name have different structures in two merged ontologies. We encountered this issue when mapping the affiliation property. We addressed it by keeping the more descriptive vocabulary in our ontology model and pruning the other, conflicting vocabulary from the model.
The notion of an author’s affiliation has a temporal dimension that swrc:affiliation used by DBLP does not cover, as it merely models the current affiliation, not the affiliation at the time a certain article was published. We simplified a temporal modeling approach proposed by Nuzzolese et al. [8] by following the reification pattern of MAG’s paperAuthorAffiliations table, i.e., turning each ternary relation of a publication, its authors and their affiliations at the time of publication into a resource.

** Diversidade de formas para modelar o mesmo conceito **

** Reificação para modelar a relação entre Autor e Filiação com contexto temporal prevista no MAG **

4.3 Mapping Data to an Ontology

Data acquired from different sources can follow a variety of data models (e.g., graph, relational, tree) or even be unstructured. Thus, having acquired the data, and having modeled a common integration ontology, the next step of constructing a knowledge graph is to convert all data into a common model.

For CSV sources such as MAG, the Sparqlify-CSV tool [5] maps the source ontology to the integration ontology. To use Sparqlify-CSV we expressed mapping rules in its intuitive Sparqlification Mapping Language [14].

** Vide https://github.com/SmartDataAnalytics/Sparqlify que não usa RDB2RDF **

5. Ermilov, I., Auer, S., Stadler, C.: User-driven semantic mapping of tabular data. In: 9th International Conference on Semantic Systems, ISEM, pp. 105–112 (2013)

4.4 Calculating Similarity and Instance Matching

... over the level of mapping instances where multiple instances refer to the same real world thing. We therefore added a data linking step to our pipeline. First of all, we keep data integrated from different sources in separate URI namespaces to avoid clashes in case different sources use same identifiers. We then created “same as” links between different URIs referring to the same thing by instance matching. Articles can be matched by common title, publication year and, if provided, the name of the conference or journal. To increase linking coverage, we considered the incidence of variations of title strings in punctuation and letter cases that occurs in different datasets, and compared them using the Jaccard similarity measure. We implemented these conversions and comparisons and the linking of the articles using the Silk workbench [18].

** Mas não funcionou bem para fazer o link entre pessoas / autores **

We tackled this problem using the semantic relations of the persons with their articles. In our data sources, persons only occur in the role of authors of publications; additionally, we can rely on links between papers as identified in the previous step. We leverage this semantics by embedding it into the author molecules (Here, a “molecule” refers to a set of one node in the knowledge graph and the immediate links to its neighbors.). First we create a hash for each article. Provided that instance match-
ing of articles is performed in the last step ...

By applying a substring similarity metric defined by Stoilos et al. [15] on the concatenated list of unique IDs of articles, we can discover if two instances of Person have common publications.

4.5 Producing and Querying a KG

Our objective in the final pipeline step is to store all the data in a form that is accessible via SPARQL queries. We employed the high-performance Apache Jena TDB as our RDF store. After importing our data into TDB we configured Apache Jena Fuseki 2 to make the data queryable using SPARQL 1.1, both from the command line and, via HTTP, from a SPARQL endpoint.

5 Related Work

Recent approaches toward constructing knowledge graphs, e.g., NOUS [2], Knowledge Vault [3] or NELL [1] focus on materializing a knowledge graph by inferring relations in the existing data.

** Aqui o foco é integrar diferentes fontes **

This step distinguishes our pipeline from the research of Szekely et al. They integrated data by building up a new ontology model while we modified the existing ontology model of the manually maintained DBLP and aggregated other vocabularies to it. The vocabulary used in DBLP has already a combination of the common vocabularies in describing the scientific metadata. Therefore, we accumulated other terms and vocabularies or modified the current model when the vocabulary of DBLP was not sufficiently describing the integrating data.

Another difference of the two works is the ETL component. From a technical perspective, Szekely et al. used the Karma framework [6] for data mapping.

** Outra ferramenta de ETL **

Traverso et al. [17] suggested applying semantics in relation discovery in existing knowledge graphs. Similarly, we apply the concept of semantic molecular similarity, but we use the semantic relations in the network toward the linking of instances during the creation of a knowledge graph.

6 Evaluation and Results

We conducted an empirical evaluation to study the effectiveness of the proposed pipeline in creating a knowledge graph from different data sources in the domain of scholarly communication metadata (SCM-KG). We assessed the following research questions:
RQ1) Can relative answer completeness be enhanced when queries are executed against an SCM-KG instead of the original sources? Is the query execution time affected when queries are executed against an SCM-KG?

RQ2) How accurate is the linking of the integrated dataset in terms of precision and relative recall? RQ3) How much data can be processed per second in the mapping and linking steps of the pipeline?

We executed each query 15 times, each time instantiated with a different author. We selected these 15 authors among the most publishing authors in WWW as found by another SPARQL query over the SCM-KG.

Queries: In the next two experiments, we defined queries and compared their results over the integrated knowledge graph with their evaluation on the isolated source datasets.

Metrics: We evaluated how much the integration enhanced the accuracy and completeness of the query results.

** Onde estão o dataset e as queries? **

6.1 Experiment One: Relative Completeness

Publications and the number of hits in the different datasets were collected.
Queries were executed for each of the 15 selected authors over the three datasets and compared them in terms of relative completeness of the result sets. Comparing the number of WWW publications in MAG, DBLP, and SCM-KG, we observed that although DBLP contains more articles for the selected authors, there exist articles that are only included in MAG. The mapping and linking process allows for identifying common articles in both datasets; thus, the resulting dataset includes more articles for these authors.

6.2 Experiment Two: Linking Accuracy and Relative Coverage

The comparison of indirect integrated duplicate author entries in MAG, due to instance matching between MAG and DBLP, indicates a correct linking (TP) with a precision of 1 in all cases, and an average recall value of 0.986. Secondly, we tested if, per author, the linked articles belonging to each author are linked to correct equivalent items between datasets. The linking performed in this experiment had a precision of 1 and an average recall of 0.982; these results show the positive effect of using semantic molecular relations in linking.

** verdadeiros positivos (TP) são artigos cujos metadados existem no DBLP e MAG e suas instâncias foram corretamente vinculadas **

6.3 Performance Evaluation of the Mapping Process Scheduler and Linking

7 Conclusions and Future Work

We showed the capability of parallelization in rule-based data mappings, and we also presented how semantic similarity measures are applied to determine the relatedness of concepts in two resources in terms of the relatedness of their RDF interlinking structure.

Results of the empirical evaluation suggest that the integration approach pursued by the SCM-KG pipeline is able to effectively integrate pieces of information spread across different data sources. The experiments suggest that the rule based mapping together with semantic structure based instance matching technique implemented in the SCM-KG pipeline integrates data in a knowledge graph with high accuracy.

Pesquisa de Doutorado da Veronica

Pesquisar este blog

Integration of Scholarly Communication Metadata Using Knowledge Graphs - Leitura de Artigo

Abstract

Marcadores

Comentários

Postar um comentário

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

Exploratory Search: From Finding to Understanding - Leitura de Artigo