Pular para o conteúdo principal

Artigo: The Microsoft Academic Knowledge Graph (MAKG): A Linked Data Source with 8 Billion Triples of Scholarly Data @ ISWC'19

MAKG is MAG provisioning as RDF knowledge graph, both in the form of RDF files (N-Triples format) and as a data source on the Web, through a SPARQL EndPoint, with HTTP-resolvable URIs. It was enriched by reusing common vocabularies, resources  are linked to  other  data  sources  on  the  Web,  such  as  DBpedia, Wikidata, OpenCitations, and the Global Research Identifier Database (GRID). It was classified as 5-star according to Tim Berners-Lee’s  deployment scheme for Open Data

Data set is licensed under the Open Data Commons Attribution License (ODC-By).

All relevant data to be modeled in RDF takes about 350 GB of disk space (input). 

MAG dump of November 2018
8,272,187,245 RDF triples
1.2 TB of disk space for the uncompressed RDF files
(output)
Virtuoso: Indexing the data requires about 514 GB of disk space and takes about 10 hours; 256 GB of RAM.
On the schema level, the MAKG contains 47 properties and 13 entity types (with 8 entity types being in the namespace http://ma-graph.org).
6,706 institute representations linked to the corresponding DBpedia concepts,
15,530 conference instances to the corresponding Wikipedia articles, and
18,673 affiliations to the corresponding GRID URIs. 

Data Citation

Färber, Michael. (2021). The Microsoft Academic Graph in RDF: A Linked Data Source with 8 Billion Triples of Scholarly Data (Version 2020-06-19) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4617285 

Site  -> https://makg.org/ 

Embeddings for the MAKG entities are available too -> https://makg.org/entity-embeddings/

Entity embeddings have proven to be useful as implicit knowledge representations in a variety of scenarios. Because the MAKG is availablein RDF, we applied RDF2Vec [16] to the MAKG using the skip-gram model, a windows size of 5, 128 dimensions, and 10 epochs of training. The training was performed on a machine with 500GB of RAM and 64 cores. The resulting embedding vectors for all 210 million papers in the MAKG (uncompressed using 310GB and compressed using 93 GB of storage) are linked on our website.
[16] Ristoski, P., Paulheim, H.: RDF2Vec: RDF Graph Embeddings for Data Mining.In: Proceedings of the 15th International Semantic Web Conference. ISWC’16(2016) 498–514 -> http://rdf2vec.org/

Older version:

    Trained RDF2Vec entity embeddings (93GB) for all publications of the MAKG (version: 2018-11-09; model: skip-gram, dim: 128, min count: 5, window: 5, epochs: 5)

Current version:

    Trained ComplEx entity embeddings (120 GB) for all 243 million authors, 239 publications, 49,000 journals, and 16,000 conferences of the MAKG (MAKG version: 2020-06-19, dim: 100, batch size: 1000, neg. sampling: 1000)

Exemplo de Query para recuperar os tópicos de Fields of Study (FoS)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?topic ?level ?relatedtopic ?WikiPedia ?DBPedia
WHERE {
?field rdf:type <https://makg.org/class/FieldOfStudy> .
?field foaf:name ?topic .
?field <https://makg.org/property/level> ?level .
?field <https://makg.org/property/isRelatedTo> ?related .
?related foaf:name ?relatedtopic .
?field <https://makg.org/property/isRelatedTo> ?related .
?field <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?WikiPedia .
?field <http://www.w3.org/2002/07/owl#sameAs> ?DBPedia .
}
LIMIT 100

URI Resolution (Está com erro!!!)

The resources of the Microsoft Academic Knowledge Graph are resolvable via HTTP and content negotiation. In this way, the knowledge graph is part of the Linked Open Data cloud.

Examples for URI resolution with curl:

curl -H "Accept:text/nt" https://makg.org/entity/2826592117
curl -H "Accept:text/n3" https://makg.org/entity/2826592117
curl -H "Accept:text/ttl" https://makg.org/entity/2826592117

Data analytics tasks:

  1. entity-centric exploration of papers, researchers, affiliations, etc.;
  2. easier data integration through the use of RDF  as a common data model and by linking resources to other data sources
  3. data analysis and knowledge discovery of scholarly data (e.g., measuring the popularity of papers and authors; recommending papers, researchers, and venues; and analyzing the evolution of topics over time).

GitHub do conversor para RDF (em Java) -> https://github.com/michaelfaerber/MAG2RDF

Ontologias utilizadas:

mag        http://ma-graph.org/
foaf       http://xmlns.com/foaf/0.1/
sioc       http://rdfs.org/sioc/ns#
dcterms    http://purl.org/dc/terms/
tl         http://purl.org/NET/c4dm/timeline.owl#
dbo        http://dbpedia.org/ontology/
frbr       http://purl.org/vocab/frbr/core#
fabio      http://purl.org/spar/fabio/
cito       http://url.org/spar/cito/
datacite   http://purl.org/spar/datacite/
prism      http://prismstandard.org/namespaces/1.2/basic/
c4o        http://purl.org/spar/c4o/

Semantic  Publishing  and  Referencing  (SPAR)  ontologies  [10],  such  as  FaBiO,CiTo, PRISM, and C4O.
[10] Peroni,  S.,  Shotton,  D.M.:   The  SPAR  Ontologies.   In:  Proceedings  of  the  17th International Semantic Web Conference. ISWC’18 (2018) 119–136

https://makg.org/wp-content/uploads/2021/03/mag-rdf-schema-2021-03-2.png

Document type of each publication was modelled according to the document types covered in the FaBiO ontology.

The URL at which each paper is available online is an attribute of each paper but the URLs provided in the MAG dump often do not link to the papers directly but to the landing pages provided by the papers’ publishers.

Affiliation’s name as literal, a link to the institution’s GRID identifier, a link to the institution’s official homepage, a link to the English Wikipedia article describing this institution were included. In particular, the links between the affiliations and the Global Research Identifier Database (GRID) identifiers are noteworthy. Because the GRID is part of the Linked Open Data cloud and because GRID URIs of the form http://www.grid.ac/institutes/grid.446382.f are resolvable via HTTP, we transformed the pure GRID identifiers into URIs by adding the URI prefix.

For a better integration of the MAKG as a data source into the Linked Open Data cloud, we transform the strings with the conference location (typically city names with their country, such as “Oslo, Norway”) into DBpedia URIs. In order to ensure a well performing word-sense-disambiguation, we use the state-of-the-art text annotation tool x-LiSA [11]. Because DBpedia is very rich in terms of cities, we obtained URIs for almost all locations (namely 15,530).

Creating owl:sameAs Statements
In addition to the MAKG core data set outlined so far, we linked instances of the MAKG to instances of OpenCitations and Wikidata. The mappings were created by matching the papers’ digital object identifiers (DOIs)

Aside from the MAG RDF documents, we provide the following linked dataset descriptions (all available at http://ma-graph.org/):
OWL: We provide our ontology as an OWL file describing the used classes, object properties, and data type properties.
VOAF: We enrich our ontology with Vocabulary-of-a-Friend (VOAF) descriptors. VOAF is an extension of VoID for linking the ontology to other vocabularies and for introducing the vocabulary to the Linked OpenData community.
VoID: We provide a VoID file to describe our linked data set with an RDF schema vocabulary.

VoID is an RDF Schema vocabulary for expressing metadata about RDF datasets. It is intended as a bridge between the publishers and users of RDF data, with applications ranging from data discovery to cataloging and archiving of datasets. This document is a detailed guide to the VoID vocabulary. It describes how VoID can be used to express general metadata based on Dublin Core, access metadata, structural metadata, and links between datasets. It also provides deployment advice and discusses the discovery of VoID descriptions.

VOAF (Vocabulary of a Friend) is a vocabulary specification providing elements allowing the description of vocabularies (RDFS vocabularies or OWL ontologies) used in the Linked Data Cloud. In particular it provides properties expressing the different ways such vocabularies can rely on, extend, specify, annotate or otherwise link to each other. It relies itself on Dublin Core and voiD. The name of the vocabulary makes an explicit reference to FOAF because VOAF can be used to define networks of vocabularies in a way similar to the one FOAF is used to define networks of people.

Tim Berners-Lee’s 5-star deployment scheme for Open Data:
Our MAKG RDF data set is a 5-star data set according to this scheme, because we provide our data set in RDF (leading to 4 stars) and link (1) entity URIs to DBpedia, Wikidata, OpenCitations, and GRID, and (2) our vocabulary URIs to other vocabularies (leading to 5 stars).This rating is intended to rate the use of vocabulary within Linked (Open) Data. By providing an OWL file, by linking our vocabulary to other vocabularies (see the SPAR ontologies), and by creating a VOAF file, we are able to provide the vocabulary with 4stars.

5-star steps by example

The MAKG can be considered a central data hub for credibility in the linked data context, because it contains metadata about papers (and their authors) that state claims. Claims and crucial concepts mentioned in text documents (e.g., papers’ full texts) can be linked to papers and authors in the MAKG to substantiate them [18].

Using the MAKG for Natural Language Processing Tasks.
Citation-based tasks, such as citation recommendation, often depend on natural language processing and require implicit or explicit representations of papers, researchers, and institutions. In the case of the MAKG, embeddings for papers and other entities can easily be generated using existing methods for RDF graph embeddings.
Entity linking describes the task of linking phrases in a text to knowledge graph entities. It has shown several advantages compared to traditional text mining and information retrieval approaches. Consequently, MAKG entities, such as the fields of study and the authors, can be used as the basis for annotating texts (e.g., annotating scientific texts with scientific concepts [19]).
Furthermore, using the MAKG, semantic search systems can be developed [20] that are superior to bag-of-words models.

Using the MAKG for Digital Library Tasks.
So far, the MAG has been used, among other ways, for citation analysis [21] and for impact analysis of papers and researchers [22,23]. The original MAG data has also been combined with AMiner data to form the Open Citation Graph. In the future, Linked Open Data-based recommender systems that recommend papers or citations can use the MAKG as an underlying database.
Furthermore, one can envision that the working style of researchers will considerably change in the next few decades [24,25]. For instance, publications might not be published in PDF format any more, but in either an annotated version of it (with information about the claims, the used methods, the data sets, the evaluation results, and so on) or in the form of a flexible publication form, in which authors can change the content and, in particular, citations, over time. The MAKG can be combined with such new structured data sets easily due to its RDF data format.

Using the MAKG for Benchmarking.
Because the MAKG is large in size (over 1 TB in N-Triples format), contains various kinds of information (e.g.,papers, authors, institutions, and venues as well as various data types), has uncertainty in the data, and is updated periodically, the MAKG data fulfills the “4 V’s” of big data very well. Thus, the MAKG may also be suitable for evaluating methods and benchmarking systems.


Comentários

  1. Foi feita apresentação sobre MAS, MAG e MAKE para o grupo BioBD

    ResponderExcluir
  2. MAKG possui mapeamentos para Wikidata. baseados no ID. Wikidata tem uma propriedade do tipo ID que permite registrar o identificador MAG de uma entidade do grafo

    ResponderExcluir
  3. Baixando os embeddings gerados com ComplEx e vou postar uma pergunta sobre reificação.

    ResponderExcluir
  4. Página de pesquisa do autor do MAKG -> https://sites.google.com/view/michaelfaerber/research

    Projeto interessante de sistema de recomendação em dados "escolares" -> https://sites.google.com/view/michaelfaerber/projects#h.ldbvbohudst5

    Zenodo para o dataset do FairNets -> https://zenodo.org/record/3885249#.YXct1Rxv-00
    Making Neural Networks FAIR .... FAIRnets Knowledge Graph, a large RDF data set with information about neural networks. The data set is based on neural networks published on GitHub using Keras as framework.

    Data Set Knowledge Graph (DSKG), a RDF data set about data sets which are linked to publications that mention the data sets. The metadata of the data sets are modeled in the standard vocabulary DCAT and are based on data sets registered in OpenAIRE and Wikidata.
    http://dskg.org/

    ResponderExcluir

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s...

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russe...

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The...