Pular para o conteúdo principal

AceKG: A Large-scale Knowledge Graph for Academic Data Mining - Leitura de Artigo (Estado da Artet)

Wang, R., Yan, Y., Wang, J., Jia, Y., Zhang, Y., Zhang, W., & Wang, X. (2018). AceKG: A Large-scale Knowledge Graph for Academic Data Mining. Proceedings of the 27th ACM International Conference on Information and Knowledge Management.

Abstract
Most existing knowledge graphs (KGs) in academic domains suffer from problems of insufficient multi-relational information, name ambiguity and improper data format for large-scale machine processing.

In this paper, we present AceKG, a new large-scale KG in academic domain. AceKG not only provides clean academic information, but also offers a large-scale benchmark dataset for researchers to conduct challenging data mining projects including link prediction, community detection and scholar classification. Specifically, AceKG describes 3.13 billion triples of academic facts based on a consistent ontology, including necessary properties of papers, authors, fields of study, venues and institutes, as well as the relations among them. To enrich the proposed knowledge graph, we also perform entity alignment with existing databases and rule-based inference.

Based on AceKG, we conduct experiments of three typical academic data mining tasks and evaluate several state-of-the-art knowledge embedding and network representation learning approaches on the benchmark datasets built from AceKG. Finally, we discuss promising research directions that benefit from AceKG.

[O KG, sua ontologia e instâncias desambiguadas e alinhadas, e os resultados empíricos do seu uso em tarefas de ML (predição, classificação e agrupamento) são as contribuições]

1 INTRODUCTION

... A knowledge graph, which describes and stores facts as triples, is a multi-relational graph consisting of entities as nodes and relations as different types of edges. ...

[Multi-relational seria o multigrafo e não o hiper relacional, aqui fatos são somente triplas]

In this paper, we propose Academic Knowledge Graph (AceKG), an academic semantic network, which describes 3.13 billion triples of academic facts based on a consistent ontology, including commonly used properties of papers, authors, fields of study, venues, institutes and relations among them. Apart from the knowledge graph itself, we also perform entity alignment with the existing KGs or datasets and some rule-based inferences to further extend it and make it linked with other KGs in the linked open data cloud.
...

[Contribuições]

AceKGis fully organized in structured triples, which is machine readable and easy to process.

[Verificar se faz uso de reificação]

2 THE KNOWLEDGE GRAPH

Link para o site onde fazer download do dataset e da ontologia -> https://archive.acemap.info/app/AceKG

2.1 Ontology

All objects (e.g., papers, institutes, authors) are represented as entities in the AceKG. Two entities can stand in a relation. Commonly used attributes of each entities including numbers, dates, strings and other literals are represented as well. Similar entities are grouped into classes. In total, AceKG defines 5 classes of academic entities: Papers, Authors, Fields of study, Venues and Institutes

[Modelo semelhante ao MAG em termos de entidades, relações binárias (não é hiper grafo)]

2.2 Entity alignment

In order to make AceKG more connected and comprehensive, we map a large part of papers in computer science of AceKG to the papers stored in IEEE, ACM and DBLP databases.

[Usou identificadores únicos para desambiguação de nomes de autores e para ligar a identificadores únicos de bases externas]

2.3 Inference

Rule-based inference on knowledge graph is a typical but critical way to enrich the knowledge graph.

[Relações indiretas, mais de que um hop, basedas em regras]

3 KNOWLEDGE EMBEDDING

3.1 Task Definition

Given a set S of triples (h; r ; t) composed of two entities h; t <pertence> E (the set of entities) and a relation r <pertence. R (the set of relationships), knowledge embedding maps each entity to a k-dimensional vector in the embedding space, and defines a scoring function to evaluate the plausibility of the triple  (h; r ; t) in the knowledge graph. ... We construct a new benchmark dataset (denoted as AK18K in the rest of this section) extracted from AceKG for knowledge embedding. We will show how it differs from FB15K and WN18 in Section 3.2. We compare the following algorithms in our experiments: TransE [1], TransH [18], DistMult [19], ComplEx [17], HolE [10].

[Novo dataset para avaliar algoritmos de predição de links com uso de embeddings no domínio de dados acadêmicos]

3.2 Experimental Setup

To extract AK18K from AceKG, we first select 68 critical international venues (conferences and journals) and influential papers published on them. Then we add the triples of authors, fields and institutes. Finally, the train/valid/test datasets are divided randomly.

[Como montar um dataset e reduzir o viés]

3.3 Evaluation Results

Although 94.4% of relations in our knowledge graph are many-to-many, which works for TransH, TransE shows its advantages on modeling sparse and simple knowledge graph, while TransH fails to achieve better results. The reason may be the number of relationship types is only 7, which is small. On the other hand, HolE and ComplEx achieve the most significant performances on the other metrics, especially on hit@1 (83.8% and 75.4%) and on filtered MRR (0.482 and 0.440), which confirms their advantages on modeling antisymmetric relations because all of our relations are antisymmetric, such as field is part of and paper is written by.

[O resultado no TransE foi maior que o TransH já que a quantidade de tipos de relações era pequena. Com  HolE e ComplEx prevaleceu a característica de ser assimétrico]

4 NETWORK REPRESENTATION LEARNING

4.1 Task Definition

Given a network G = (V; E;A), where V denotes the vertex set, E denotes the network topology structure and A preserves node attributions, the task of NRL is to learn a mapping function f : v -> rv <pertence> Rd , where rv is the learned representation of vertex v and d is the dimension of vr . We study and evaluate related methods including DeepWalk [11], PTE [14], LINE [15] and metapath2vec [3] on two tasks: scholar classification and scholar clustering.

[Tarefas de classificação e agrupamento de dados através de algoritmos de ML que aprendem funções]

4.2 Experimental Setup

Based on AceKG, we first select 5 fields of study (FOS) and 5 main subfields of each. Then we extract all scholars, papers and venues in those fields of study respectively to construct 5 heterogeneous collaboration networks. We also construct 2 larger academic knowledge graph: (i) we integrate 5 networks above into one graph which contains all the information of 5 fields of study; (ii) we match the eight categories of venues in Google Scholar to those in AceKG. 151 of 160 venues (8 categories x 20 per category) are successfully matched. Then we select all the related papers and scholars to construct one large heterogeneous collaboration networks.

[Biologia, Ciência da Computação, Economia, Medicina, Física: 5 áreas principais e com 5 subtópicos cada]

[5 subgrafos, uma para cada área com dados de pesquisadores, artigos e veículos. Um subgrafo integrando esses 5. Um subgrafo montado usando as 8 categorias do Google Scholar mapeadas em categorias do AceKG]

4.3 Evaluation Results

4.3.1 Classification. We adopt logistic regression to conduct scholar classification tasks. Note that in this task 5-fold cross validation are adopted.

It should be noted that there is significant performance gap between FOS-labeled datasets and the Google-labeled dataset, which is because of the different distribution of papers and scholars. Papers collected in the Google-labeled dataset are published in Top-venues and consequently few scholar could be active in multiple categories, while there are more cross-field papers and scholars in FOS-labeled datasets.
Moreover, the performance indicates the level of interdiscipline in these fields.

[Aprendeu a função de classificação melhor com os rótulos do Google por questões intrínsecas dos dados]

4.3.2 Clustering. Based on the same node representation in scholar classication task, we further conduct scholar clustering experiment with k-means algorithm to evaluate the models’ performance. All clustering experiments are conducted 10 times and the average performance is reported.

[Idem no caso da função de classificação]

5 FUTURE DIRECTIONS

Cooperation prediction. To predict a researcher’s future cooperation behavior is an interesting topic in academic mining, and many current works have contributed to it by considering previous cooperators, neighborhood, citation relations and other sideinformation. ... Given this situation, one may perform cooperation prediction based on the NRL results, which can represent the features of a researcher better and may provide some help to cooperation prediction task.

Author disambiguation. ... the help of AceKG, author disambiguation can be conducted conveniently. The network structure and node attributes in AceKG can enhance the author disambiguation performance.

Finding rising star. Researchers have raised various algorithms for this based on publication increasing rate, mentoring relations and some other factors.

[Outros problemas do domínio acadêmico que poderiam ser resolvidos com ML usando dados do AceKG]

Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...