Wang, R., Yan, Y., Wang, J., Jia, Y., Zhang, Y., Zhang, W., & Wang, X. (2018). AceKG: A Large-scale Knowledge Graph for Academic Data Mining. Proceedings of the 27th ACM International Conference on Information and Knowledge Management.
Abstract
Most existing knowledge graphs (KGs) in academic domains suffer from problems of insufficient multi-relational information, name ambiguity and improper data format for large-scale machine processing.
In this paper, we present AceKG, a new large-scale KG in academic domain. AceKG not only provides clean academic information, but also offers a large-scale benchmark dataset for researchers to conduct challenging data mining projects including link prediction, community detection and scholar classification. Specifically, AceKG describes 3.13 billion triples of academic facts based on a consistent ontology, including necessary properties of papers, authors, fields of study, venues and institutes, as well as the relations among them. To enrich the proposed knowledge graph, we also perform entity alignment with existing databases and rule-based inference.
Based on AceKG, we conduct experiments of three typical academic data mining tasks and evaluate several state-of-the-art knowledge embedding and network representation learning approaches on the benchmark datasets built from AceKG. Finally, we discuss promising research directions that benefit from AceKG.
[O KG, sua ontologia e instâncias desambiguadas e alinhadas, e os resultados empíricos do seu uso em tarefas de ML (predição, classificação e agrupamento) são as contribuições]
1 INTRODUCTION
... A knowledge graph, which describes and stores facts as triples, is a multi-relational graph consisting of entities as nodes and relations as different types of edges. ...
[Multi-relational seria o multigrafo e não o hiper relacional, aqui fatos são somente triplas]
In this paper, we propose Academic Knowledge Graph (AceKG), an academic semantic network, which describes 3.13 billion triples of academic facts based on a consistent ontology, including commonly used properties of papers, authors, fields of study, venues, institutes and relations among them. Apart from the knowledge graph itself, we also perform entity alignment with the existing KGs or datasets and some rule-based inferences to further extend it and make it linked with other KGs in the linked open data cloud.
...
[Contribuições]
AceKGis fully organized in structured triples, which is machine readable and easy to process.
[Verificar se faz uso de reificação]
2 THE KNOWLEDGE GRAPH
Link para o site onde fazer download do dataset e da ontologia -> https://archive.acemap.info/app/AceKG
2.1 Ontology
All objects (e.g., papers, institutes, authors) are represented as entities in the AceKG. Two entities can stand in a relation. Commonly used attributes of each entities including numbers, dates, strings and other literals are represented as well. Similar entities are grouped into classes. In total, AceKG defines 5 classes of academic entities: Papers, Authors, Fields of study, Venues and Institutes
[Modelo semelhante ao MAG em termos de entidades, relações binárias (não é hiper grafo)]
2.2 Entity alignment
In order to make AceKG more connected and comprehensive, we map a large part of papers in computer science of AceKG to the papers stored in IEEE, ACM and DBLP databases.
[Usou identificadores únicos para desambiguação de nomes de autores e para ligar a identificadores únicos de bases externas]
2.3 Inference
Rule-based inference on knowledge graph is a typical but critical way to enrich the knowledge graph.
[Relações indiretas, mais de que um hop, basedas em regras]
3 KNOWLEDGE EMBEDDING
3.1 Task Definition
Given a set S of triples (h; r ; t) composed of two entities h; t <pertence> E (the set of entities) and a relation r <pertence. R (the set of relationships), knowledge embedding maps each entity to a k-dimensional vector in the embedding space, and defines a scoring function to evaluate the plausibility of the triple (h; r ; t) in the knowledge graph. ... We construct a new benchmark dataset (denoted as AK18K in the rest of this section) extracted from AceKG for knowledge embedding. We will show how it differs from FB15K and WN18 in Section 3.2. We compare the following algorithms in our experiments: TransE [1], TransH [18], DistMult [19], ComplEx [17], HolE [10].
[Novo dataset para avaliar algoritmos de predição de links com uso de embeddings no domínio de dados acadêmicos]
3.2 Experimental Setup
To extract AK18K from AceKG, we first select 68 critical international venues (conferences and journals) and influential papers published on them. Then we add the triples of authors, fields and institutes. Finally, the train/valid/test datasets are divided randomly.
[Como montar um dataset e reduzir o viés]
3.3 Evaluation Results
Although 94.4% of relations in our knowledge graph are many-to-many, which works for TransH, TransE shows its advantages on modeling sparse and simple knowledge graph, while TransH fails to achieve better results. The reason may be the number of relationship types is only 7, which is small. On the other hand, HolE and ComplEx achieve the most significant performances on the other metrics, especially on hit@1 (83.8% and 75.4%) and on filtered MRR (0.482 and 0.440), which confirms their advantages on modeling antisymmetric relations because all of our relations are antisymmetric, such as field is part of and paper is written by.
[O resultado no TransE foi maior que o TransH já que a quantidade de tipos de relações era pequena. Com HolE e ComplEx prevaleceu a característica de ser assimétrico]
4 NETWORK REPRESENTATION LEARNING
4.1 Task Definition
Given a network G = (V; E;A), where V denotes the vertex set, E denotes the network topology structure and A preserves node attributions, the task of NRL is to learn a mapping function f : v -> rv <pertence> Rd , where rv is the learned representation of vertex v and d is the dimension of vr . We study and evaluate related methods including DeepWalk [11], PTE [14], LINE [15] and metapath2vec [3] on two tasks: scholar classification and scholar clustering.
[Tarefas de classificação e agrupamento de dados através de algoritmos de ML que aprendem funções]
4.2 Experimental Setup
Based on AceKG, we first select 5 fields of study (FOS) and 5 main subfields of each. Then we extract all scholars, papers and venues in those fields of study respectively to construct 5 heterogeneous collaboration networks. We also construct 2 larger academic knowledge graph: (i) we integrate 5 networks above into one graph which contains all the information of 5 fields of study; (ii) we match the eight categories of venues in Google Scholar to those in AceKG. 151 of 160 venues (8 categories x 20 per category) are successfully matched. Then we select all the related papers and scholars to construct one large heterogeneous collaboration networks.
[Biologia, Ciência da Computação, Economia, Medicina, Física: 5 áreas principais e com 5 subtópicos cada]
[5 subgrafos, uma para cada área com dados de pesquisadores, artigos e veículos. Um subgrafo integrando esses 5. Um subgrafo montado usando as 8 categorias do Google Scholar mapeadas em categorias do AceKG]
4.3 Evaluation Results
4.3.1 Classification. We adopt logistic regression to conduct scholar classification tasks. Note that in this task 5-fold cross validation are adopted.
It should be noted that there is significant performance gap between FOS-labeled datasets and the Google-labeled dataset, which is because of the different distribution of papers and scholars. Papers collected in the Google-labeled dataset are published in Top-venues and consequently few scholar could be active in multiple categories, while there are more cross-field papers and scholars in FOS-labeled datasets.
Moreover, the performance indicates the level of interdiscipline in these fields.
[Aprendeu a função de classificação melhor com os rótulos do Google por questões intrínsecas dos dados]
4.3.2 Clustering. Based on the same node representation in scholar classication task, we further conduct scholar clustering experiment with k-means algorithm to evaluate the models’ performance. All clustering experiments are conducted 10 times and the average performance is reported.
[Idem no caso da função de classificação]
5 FUTURE DIRECTIONS
Cooperation prediction. To predict a researcher’s future cooperation behavior is an interesting topic in academic mining, and many current works have contributed to it by considering previous cooperators, neighborhood, citation relations and other sideinformation. ... Given this situation, one may perform cooperation prediction based on the NRL results, which can represent the features of a researcher better and may provide some help to cooperation prediction task.
Author disambiguation. ... the help of AceKG, author disambiguation can be conducted conveniently. The network structure and node attributes in AceKG can enhance the author disambiguation performance.
Finding rising star. Researchers have raised various algorithms for this based on publication increasing rate, mentoring relations and some other factors.
[Outros problemas do domínio acadêmico que poderiam ser resolvidos com ML usando dados do AceKG]
Abstract
Most existing knowledge graphs (KGs) in academic domains suffer from problems of insufficient multi-relational information, name ambiguity and improper data format for large-scale machine processing.
In this paper, we present AceKG, a new large-scale KG in academic domain. AceKG not only provides clean academic information, but also offers a large-scale benchmark dataset for researchers to conduct challenging data mining projects including link prediction, community detection and scholar classification. Specifically, AceKG describes 3.13 billion triples of academic facts based on a consistent ontology, including necessary properties of papers, authors, fields of study, venues and institutes, as well as the relations among them. To enrich the proposed knowledge graph, we also perform entity alignment with existing databases and rule-based inference.
Based on AceKG, we conduct experiments of three typical academic data mining tasks and evaluate several state-of-the-art knowledge embedding and network representation learning approaches on the benchmark datasets built from AceKG. Finally, we discuss promising research directions that benefit from AceKG.
[O KG, sua ontologia e instâncias desambiguadas e alinhadas, e os resultados empíricos do seu uso em tarefas de ML (predição, classificação e agrupamento) são as contribuições]
1 INTRODUCTION
... A knowledge graph, which describes and stores facts as triples, is a multi-relational graph consisting of entities as nodes and relations as different types of edges. ...
[Multi-relational seria o multigrafo e não o hiper relacional, aqui fatos são somente triplas]
In this paper, we propose Academic Knowledge Graph (AceKG), an academic semantic network, which describes 3.13 billion triples of academic facts based on a consistent ontology, including commonly used properties of papers, authors, fields of study, venues, institutes and relations among them. Apart from the knowledge graph itself, we also perform entity alignment with the existing KGs or datasets and some rule-based inferences to further extend it and make it linked with other KGs in the linked open data cloud.
...
[Contribuições]
AceKGis fully organized in structured triples, which is machine readable and easy to process.
[Verificar se faz uso de reificação]
2 THE KNOWLEDGE GRAPH
Link para o site onde fazer download do dataset e da ontologia -> https://archive.acemap.info/app/AceKG
2.1 Ontology
All objects (e.g., papers, institutes, authors) are represented as entities in the AceKG. Two entities can stand in a relation. Commonly used attributes of each entities including numbers, dates, strings and other literals are represented as well. Similar entities are grouped into classes. In total, AceKG defines 5 classes of academic entities: Papers, Authors, Fields of study, Venues and Institutes
[Modelo semelhante ao MAG em termos de entidades, relações binárias (não é hiper grafo)]
2.2 Entity alignment
In order to make AceKG more connected and comprehensive, we map a large part of papers in computer science of AceKG to the papers stored in IEEE, ACM and DBLP databases.
[Usou identificadores únicos para desambiguação de nomes de autores e para ligar a identificadores únicos de bases externas]
2.3 Inference
Rule-based inference on knowledge graph is a typical but critical way to enrich the knowledge graph.
[Relações indiretas, mais de que um hop, basedas em regras]
3 KNOWLEDGE EMBEDDING
3.1 Task Definition
Given a set S of triples (h; r ; t) composed of two entities h; t <pertence> E (the set of entities) and a relation r <pertence. R (the set of relationships), knowledge embedding maps each entity to a k-dimensional vector in the embedding space, and defines a scoring function to evaluate the plausibility of the triple (h; r ; t) in the knowledge graph. ... We construct a new benchmark dataset (denoted as AK18K in the rest of this section) extracted from AceKG for knowledge embedding. We will show how it differs from FB15K and WN18 in Section 3.2. We compare the following algorithms in our experiments: TransE [1], TransH [18], DistMult [19], ComplEx [17], HolE [10].
[Novo dataset para avaliar algoritmos de predição de links com uso de embeddings no domínio de dados acadêmicos]
3.2 Experimental Setup
To extract AK18K from AceKG, we first select 68 critical international venues (conferences and journals) and influential papers published on them. Then we add the triples of authors, fields and institutes. Finally, the train/valid/test datasets are divided randomly.
[Como montar um dataset e reduzir o viés]
3.3 Evaluation Results
Although 94.4% of relations in our knowledge graph are many-to-many, which works for TransH, TransE shows its advantages on modeling sparse and simple knowledge graph, while TransH fails to achieve better results. The reason may be the number of relationship types is only 7, which is small. On the other hand, HolE and ComplEx achieve the most significant performances on the other metrics, especially on hit@1 (83.8% and 75.4%) and on filtered MRR (0.482 and 0.440), which confirms their advantages on modeling antisymmetric relations because all of our relations are antisymmetric, such as field is part of and paper is written by.
[O resultado no TransE foi maior que o TransH já que a quantidade de tipos de relações era pequena. Com HolE e ComplEx prevaleceu a característica de ser assimétrico]
4 NETWORK REPRESENTATION LEARNING
4.1 Task Definition
Given a network G = (V; E;A), where V denotes the vertex set, E denotes the network topology structure and A preserves node attributions, the task of NRL is to learn a mapping function f : v -> rv <pertence> Rd , where rv is the learned representation of vertex v and d is the dimension of vr . We study and evaluate related methods including DeepWalk [11], PTE [14], LINE [15] and metapath2vec [3] on two tasks: scholar classification and scholar clustering.
[Tarefas de classificação e agrupamento de dados através de algoritmos de ML que aprendem funções]
4.2 Experimental Setup
Based on AceKG, we first select 5 fields of study (FOS) and 5 main subfields of each. Then we extract all scholars, papers and venues in those fields of study respectively to construct 5 heterogeneous collaboration networks. We also construct 2 larger academic knowledge graph: (i) we integrate 5 networks above into one graph which contains all the information of 5 fields of study; (ii) we match the eight categories of venues in Google Scholar to those in AceKG. 151 of 160 venues (8 categories x 20 per category) are successfully matched. Then we select all the related papers and scholars to construct one large heterogeneous collaboration networks.
[Biologia, Ciência da Computação, Economia, Medicina, Física: 5 áreas principais e com 5 subtópicos cada]
[5 subgrafos, uma para cada área com dados de pesquisadores, artigos e veículos. Um subgrafo integrando esses 5. Um subgrafo montado usando as 8 categorias do Google Scholar mapeadas em categorias do AceKG]
4.3 Evaluation Results
4.3.1 Classification. We adopt logistic regression to conduct scholar classification tasks. Note that in this task 5-fold cross validation are adopted.
It should be noted that there is significant performance gap between FOS-labeled datasets and the Google-labeled dataset, which is because of the different distribution of papers and scholars. Papers collected in the Google-labeled dataset are published in Top-venues and consequently few scholar could be active in multiple categories, while there are more cross-field papers and scholars in FOS-labeled datasets.
Moreover, the performance indicates the level of interdiscipline in these fields.
[Aprendeu a função de classificação melhor com os rótulos do Google por questões intrínsecas dos dados]
4.3.2 Clustering. Based on the same node representation in scholar classication task, we further conduct scholar clustering experiment with k-means algorithm to evaluate the models’ performance. All clustering experiments are conducted 10 times and the average performance is reported.
[Idem no caso da função de classificação]
5 FUTURE DIRECTIONS
Cooperation prediction. To predict a researcher’s future cooperation behavior is an interesting topic in academic mining, and many current works have contributed to it by considering previous cooperators, neighborhood, citation relations and other sideinformation. ... Given this situation, one may perform cooperation prediction based on the NRL results, which can represent the features of a researcher better and may provide some help to cooperation prediction task.
Author disambiguation. ... the help of AceKG, author disambiguation can be conducted conveniently. The network structure and node attributes in AceKG can enhance the author disambiguation performance.
Finding rising star. Researchers have raised various algorithms for this based on publication increasing rate, mentoring relations and some other factors.
[Outros problemas do domínio acadêmico que poderiam ser resolvidos com ML usando dados do AceKG]
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.