Artigo: The Microsoft Academic Knowledge Graph (MAKG): A Linked Data Source with 8 Billion Triples of Scholarly Data @ ISWC'19

MAKG is MAG provisioning as RDF knowledge graph, both in the form of RDF files (N-Triples format) and as a data source on the Web, through a SPARQL EndPoint, with HTTP-resolvable URIs. It was enriched by reusing common vocabularies, resources are linked to other data sources on the Web, such as DBpedia, Wikidata, OpenCitations, and the Global Research Identifier Database (GRID). It was classified as 5-star according to Tim Berners-Lee’s deployment scheme for Open Data

Data set is licensed under the Open Data Commons Attribution License (ODC-By).

All relevant data to be modeled in RDF takes about 350 GB of disk space (input).

MAG dump of November 2018
8,272,187,245 RDF triples
1.2 TB of disk space for the uncompressed RDF files (output)
Virtuoso: Indexing the data requires about 514 GB of disk space and takes about 10 hours; 256 GB of RAM.
On the schema level, the MAKG contains 47 properties and 13 entity types (with 8 entity types being in the namespace http://ma-graph.org).
6,706 institute representations linked to the corresponding DBpedia concepts,
15,530 conference instances to the corresponding Wikipedia articles, and
18,673 affiliations to the corresponding GRID URIs.

Data Citation

Färber, Michael. (2021). The Microsoft Academic Graph in RDF: A Linked Data Source with 8 Billion Triples of Scholarly Data (Version 2020-06-19) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4617285

Site -> https://makg.org/

Embeddings for the MAKG entities are available too -> https://makg.org/entity-embeddings/

Entity embeddings have proven to be useful as implicit knowledge representations in a variety of scenarios. Because the MAKG is availablein RDF, we applied RDF2Vec [16] to the MAKG using the skip-gram model, a windows size of 5, 128 dimensions, and 10 epochs of training. The training was performed on a machine with 500GB of RAM and 64 cores. The resulting embedding vectors for all 210 million papers in the MAKG (uncompressed using 310GB and compressed using 93 GB of storage) are linked on our website.
[16] Ristoski, P., Paulheim, H.: RDF2Vec: RDF Graph Embeddings for Data Mining.In: Proceedings of the 15th International Semantic Web Conference. ISWC’16(2016) 498–514 -> http://rdf2vec.org/

Older version:

Trained RDF2Vec entity embeddings (93GB) for all publications of the MAKG (version: 2018-11-09; model: skip-gram, dim: 128, min count: 5, window: 5, epochs: 5)

Current version:

Trained ComplEx entity embeddings (120 GB) for all 243 million authors, 239 publications, 49,000 journals, and 16,000 conferences of the MAKG (MAKG version: 2020-06-19, dim: 100, batch size: 1000, neg. sampling: 1000)

Exemplo de Query para recuperar os tópicos de Fields of Study (FoS)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?topic ?level ?relatedtopic ?WikiPedia ?DBPedia
WHERE {
?field rdf:type <https://makg.org/class/FieldOfStudy> .
?field foaf:name ?topic .
?field <https://makg.org/property/level> ?level .
?field <https://makg.org/property/isRelatedTo> ?related .
?related foaf:name ?relatedtopic .
?field <https://makg.org/property/isRelatedTo> ?related .
?field <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?WikiPedia .
?field <http://www.w3.org/2002/07/owl#sameAs> ?DBPedia .
}
LIMIT 100

URI Resolution (Está com erro!!!)

The resources of the Microsoft Academic Knowledge Graph are resolvable via HTTP and content negotiation. In this way, the knowledge graph is part of the Linked Open Data cloud.

Examples for URI resolution with curl:

curl -H "Accept:text/nt" https://makg.org/entity/2826592117
curl -H "Accept:text/n3" https://makg.org/entity/2826592117
curl -H "Accept:text/ttl" https://makg.org/entity/2826592117

Data analytics tasks:

entity-centric exploration of papers, researchers, affiliations, etc.;
easier data integration through the use of RDF as a common data model and by linking resources to other data sources
data analysis and knowledge discovery of scholarly data (e.g., measuring the popularity of papers and authors; recommending papers, researchers, and venues; and analyzing the evolution of topics over time).

GitHub do conversor para RDF (em Java) -> https://github.com/michaelfaerber/MAG2RDF

Ontologias utilizadas:

mag        http://ma-graph.org/
foaf       http://xmlns.com/foaf/0.1/
sioc       http://rdfs.org/sioc/ns#
dcterms    http://purl.org/dc/terms/
tl         http://purl.org/NET/c4dm/timeline.owl#
dbo        http://dbpedia.org/ontology/
frbr       http://purl.org/vocab/frbr/core#
fabio      http://purl.org/spar/fabio/
cito       http://url.org/spar/cito/
datacite   http://purl.org/spar/datacite/
prism      http://prismstandard.org/namespaces/1.2/basic/
c4o        http://purl.org/spar/c4o/

Semantic Publishing and Referencing (SPAR) ontologies [10], such as FaBiO,CiTo, PRISM, and C4O.
[10] Peroni, S., Shotton, D.M.: The SPAR Ontologies. In: Proceedings of the 17th International Semantic Web Conference. ISWC’18 (2018) 119–136

Document type of each publication was modelled according to the document types covered in the FaBiO ontology.

The URL at which each paper is available online is an attribute of each paper but the URLs provided in the MAG dump often do not link to the papers directly but to the landing pages provided by the papers’ publishers.

Affiliation’s name as literal, a link to the institution’s GRID identifier, a link to the institution’s official homepage, a link to the English Wikipedia article describing this institution were included. In particular, the links between the affiliations and the Global Research Identifier Database (GRID) identifiers are noteworthy. Because the GRID is part of the Linked Open Data cloud and because GRID URIs of the form http://www.grid.ac/institutes/grid.446382.f are resolvable via HTTP, we transformed the pure GRID identifiers into URIs by adding the URI prefix.

For a better integration of the MAKG as a data source into the Linked Open Data cloud, we transform the strings with the conference location (typically city names with their country, such as “Oslo, Norway”) into DBpedia URIs. In order to ensure a well performing word-sense-disambiguation, we use the state-of-the-art text annotation tool x-LiSA [11]. Because DBpedia is very rich in terms of cities, we obtained URIs for almost all locations (namely 15,530).

Creating owl:sameAs Statements
In addition to the MAKG core data set outlined so far, we linked instances of the MAKG to instances of OpenCitations and Wikidata. The mappings were created by matching the papers’ digital object identifiers (DOIs)

Aside from the MAG RDF documents, we provide the following linked dataset descriptions (all available at http://ma-graph.org/):
OWL: We provide our ontology as an OWL file describing the used classes, object properties, and data type properties.
VOAF: We enrich our ontology with Vocabulary-of-a-Friend (VOAF) descriptors. VOAF is an extension of VoID for linking the ontology to other vocabularies and for introducing the vocabulary to the Linked OpenData community.
VoID: We provide a VoID file to describe our linked data set with an RDF schema vocabulary.

VoID is an RDF Schema vocabulary for expressing metadata about RDF datasets. It is intended as a bridge between the publishers and users of RDF data, with applications ranging from data discovery to cataloging and archiving of datasets. This document is a detailed guide to the VoID vocabulary. It describes how VoID can be used to express general metadata based on Dublin Core, access metadata, structural metadata, and links between datasets. It also provides deployment advice and discusses the discovery of VoID descriptions.

VOAF (Vocabulary of a Friend) is a vocabulary specification providing elements allowing the description of vocabularies (RDFS vocabularies or OWL ontologies) used in the Linked Data Cloud. In particular it provides properties expressing the different ways such vocabularies can rely on, extend, specify, annotate or otherwise link to each other. It relies itself on Dublin Core and voiD. The name of the vocabulary makes an explicit reference to FOAF because VOAF can be used to define networks of vocabularies in a way similar to the one FOAF is used to define networks of people.

Tim Berners-Lee’s 5-star deployment scheme for Open Data:
Our MAKG RDF data set is a 5-star data set according to this scheme, because we provide our data set in RDF (leading to 4 stars) and link (1) entity URIs to DBpedia, Wikidata, OpenCitations, and GRID, and (2) our vocabulary URIs to other vocabularies (leading to 5 stars).This rating is intended to rate the use of vocabulary within Linked (Open) Data. By providing an OWL file, by linking our vocabulary to other vocabularies (see the SPAR ontologies), and by creating a VOAF file, we are able to provide the vocabulary with 4stars.

The MAKG can be considered a central data hub for credibility in the linked data context, because it contains metadata about papers (and their authors) that state claims. Claims and crucial concepts mentioned in text documents (e.g., papers’ full texts) can be linked to papers and authors in the MAKG to substantiate them [18].

Using the MAKG for Natural Language Processing Tasks.
Citation-based tasks, such as citation recommendation, often depend on natural language processing and require implicit or explicit representations of papers, researchers, and institutions. In the case of the MAKG, embeddings for papers and other entities can easily be generated using existing methods for RDF graph embeddings.
Entity linking describes the task of linking phrases in a text to knowledge graph entities. It has shown several advantages compared to traditional text mining and information retrieval approaches. Consequently, MAKG entities, such as the fields of study and the authors, can be used as the basis for annotating texts (e.g., annotating scientific texts with scientific concepts [19]).
Furthermore, using the MAKG, semantic search systems can be developed [20] that are superior to bag-of-words models.

Using the MAKG for Digital Library Tasks.
So far, the MAG has been used, among other ways, for citation analysis [21] and for impact analysis of papers and researchers [22,23]. The original MAG data has also been combined with AMiner data to form the Open Citation Graph. In the future, Linked Open Data-based recommender systems that recommend papers or citations can use the MAKG as an underlying database.
Furthermore, one can envision that the working style of researchers will considerably change in the next few decades [24,25]. For instance, publications might not be published in PDF format any more, but in either an annotated version of it (with information about the claims, the used methods, the data sets, the evaluation results, and so on) or in the form of a flexible publication form, in which authors can change the content and, in particular, citations, over time. The MAKG can be combined with such new structured data sets easily due to its RDF data format.

Using the MAKG for Benchmarking.
Because the MAKG is large in size (over 1 TB in N-Triples format), contains various kinds of information (e.g.,papers, authors, institutions, and venues as well as various data types), has uncertainty in the data, and is updated periodically, the MAKG data fulfills the “4 V’s” of big data very well. Thus, the MAKG may also be suitable for evaluating methods and benchmarking systems.

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens: realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward) Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Comentários

Veronica dos Santos30 de abril de 2021 às 16:57
Foi feita apresentação sobre MAS, MAG e MAKE para o grupo BioBD
ResponderExcluir
Respostas
Veronica dos Santos1 de julho de 2021 às 20:44
MAKG possui mapeamentos para Wikidata. baseados no ID. Wikidata tem uma propriedade do tipo ID que permite registrar o identificador MAG de uma entidade do grafo
ResponderExcluir
Respostas
Veronica dos Santos25 de outubro de 2021 às 15:04
Baixando os embeddings gerados com ComplEx e vou postar uma pergunta sobre reificação.
ResponderExcluir
Respostas
Veronica dos Santos25 de outubro de 2021 às 15:31
Página de pesquisa do autor do MAKG -> https://sites.google.com/view/michaelfaerber/research

Projeto interessante de sistema de recomendação em dados "escolares" -> https://sites.google.com/view/michaelfaerber/projects#h.ldbvbohudst5

Zenodo para o dataset do FairNets -> https://zenodo.org/record/3885249#.YXct1Rxv-00
Making Neural Networks FAIR .... FAIRnets Knowledge Graph, a large RDF data set with information about neural networks. The data set is based on neural networks published on GitHub using Keras as framework.

Data Set Knowledge Graph (DSKG), a RDF data set about data sets which are linked to publications that mention the data sets. The metadata of the data sets are modeled in the standard vocabulary DCAT and are based on data sets registered in OpenAIRE and Wikidata.
http://dskg.org/
ResponderExcluir
Respostas

Adicionar comentário

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Pesquisa de Doutorado da Veronica

Pesquisar este blog