Brack A., Hoppe A., Ewerth R. (2021) Citation Recommendation for Research Papers via Knowledge Graphs. In: Berget G., Hall M.M., Brenn D., Kumpulainen S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2021. Lecture Notes in Computer Science, vol 12866. Springer, Cham. https://doi.org/10.1007/978-3-030-86324-1_20
Abstract
Citation recommendation for research papers is a valuable task that can help researchers improve the quality of their work by suggesting relevant related work. Current approaches for this task rely primarily on the text of the papers and the citation network.
** O SPECTER seria o estado da arte no sentido da proposta com melhores resultados **
In this paper, we propose to exploit an additional source of information, namely research knowledge graphs (KGs) that interlink research papers based on mentioned scientific concepts.
** O KG é composto de conceitos e documentos **
Our experimental results demonstrate that the combination of information from research KGs with existing state-of-the-art approaches is beneficial. Experimental results are presented for the STM-KG (STM: Science, Technology, Medicine), which is an automatically populated knowledge graph based on the scientific concepts extracted from papers of ten domains.
** STM-KG **
1 Introduction
... recommendation of suitable references for a piece of scientific writing is an important task to (a) improve the quality of future publications, (b) help authors and reviewers to point out additional relevant related work, and (c) discover interesting links to other areas of research.
... (1) local citation recommendation which aims to provide citations for a short passage of text, and global citation recommendation which uses the documents’ full text or abstract as the input. Here, we focus on the task of global citation recommendation.
** Citação global seria mais voltado para trabalhos relacionados **
... In this paper, we explore another source of information, that is the set of scientific concepts which are mentioned in the article. The assumptions are (1) that additionally to the article’s text, these provide condensed evidence to the described problem statement, used methodology or evaluation metrics, and (2) that research papers which should be citing each other usually share a similar set of concepts.
** A ligação entre os documentos seriam os conceitos em comum. Com o vetor one-hot é possível identificar tanto as igualdades quanto as diferenças **
2 Related Work
2.1 Research Knowledge Graphs
Various KGs interlink research papers through metadata (e.g. authors, venues) and citations, or through research artefacts (e.g. datasets).
Other initiatives organise scientific knowledge in a structured manner with community effort, such as Gene Ontology, WikiData with encyclopaedic knowledge, or Papers With Code and Open Research Knowledge Graph (ORKG) for research contributions.
Furthermore, various KGs have been populated automatically from research articles. Computer Science Ontology (CSO) is a taxonomy for computer science research areas. Kannan et al. create a multimodal KG for deep learning papers from text and images and the corresponding source code. The AI-KG has been generated from 333,000 research papers in the field of artificial intelligence (AI). It contains five
concept types (tasks, methods, metrics, materials, others) linked by 27 relations types. The COVID-19 KG has been populated from the Covid-19 Open Research Dataset (CORD-19) and contains various biological concept entities. Brack et al. generate a KG (STM-KG) for ten science domains with the concept types material, method, process, and data.
** Vários exemplos de KG construídos manualmente ou automaticamente usando dados e/ou metadados de publicações científicas **
** AI-KG includes about 14M RDF triples and 1.2M reified statements extracted from 333K research publications in the field of AI, and describes 5 types of entities (tasks, methods, metrics, materials, others) linked by 27 relations. ... AI-KG is available under CC BY 4.0 and can be downloaded as a dump or queried via a SPARQL endpoint. **
2.2 Citation Recommendation
Bhagavatula et al. propose a neural network-based document embedding model to retrieve candidate documents for a query document via similarity search and a ranking model to rerank the top-k candidates. Cohan et al. propose a document embedding model named SPECTER (Scientific Paper Embeddings using Citationinformed TransformERs).
** Abordagens de ML em corpus de documentos para solução do problema **** Text embeddings and the citation graph -> SPECTER **
Graph-based representation learning approaches learn document embeddings via graph convolution networks on the citation graph. However, they require the citation network also at inference time. Other approaches frame citation recommendation as a binary classification task: given a query and a candidate paper, the model learns to predict whether the query paper should cite the candidate paper. The models learn rich relationships between the contents of the two documents via various cross-document attention mechanisms. However, in contrast to the document embedding models, such binary classification models can not be used for retrieval but only for reranking the top k results, since a query paper has to be compared with all other documents.
** Abordagem de ML para estimar a similaridade entre o paper e o que deveria ser citado **
To the best of our knowledge, approaches for citation recommendation that exploit knowledge graphs with scientific concepts have not been proposed yet.
3 Citation Recommendation via a Research Knowledge Graph
.. we propose an approach to combine document embeddings learned from textual content and the citation graph together with scientific concepts mentioned in the document.
Let KG = (D, E, V ) be a KG, D the set of documents, E the set of concepts, V ⊆ D × E the set of links between papers and concepts, and Ed ⊆ E the set of concepts mentioned in paper d ∈ D. Let one hot(ei) ∈ R|E| be the one-hot vector for concept ei in which the i-th component equals 1 and all remaining components are 0.
** O KG é baseado nos conceitos que os papers possuem **** One hot é a geração do vetor com 0 e 1, sendo 1 na posição correspondente ao conceito que o paper está relacionado **
Furthermore, let sd be a document embedding of paper d obtained via an existing document embedding model (e.g. SPECTER [9]). The vector representation ~d of paper d is the concatenation of the concept vector cd and the document embedding sd:
** Gera o vetor de embedding do documento com alguma técnica e concatena com o vetor one-hot dos conceitos e o resultado passar ser o vetor de representação da entidade **
For a query paper q ∈ D the task is to retrieve the top k results such that papers to be cited appear at the top of the list. We use cosine similarity for retrieval and ranking:
** A similaridade serve para recuperar e ordenar mas tem que fazer o cálculo para todas as entidades do grafo **
4 Experimental Setup
Benchmark Dataset: ... we use the STM-KG [6] as our benchmark dataset ... ten different scientific, technical, and medical domains and comes in two variants: (1) in-domain KG that shares scientific concepts only between papers of the same domain to avoid ambiguity of scientific terms (e.g. neural network in medicine vs. computer science), and (2) cross-domain KG that shares scientific concepts also between domains.
** Se é conceito pq a grafia da palavra iria atrapalhar ???? **
SciBERT: Document embedding is also the average of the contextual word embeddings obtained from the abstract of the paper via SciBERT that is based on BERT and has been pre-trained on scientific text. It has demonstrated superior performance in various downstream tasks on research papers.
** Modelo pré treinado em corpus de texto científico em inglês **
Evaluation: To evaluate the quality of the ranking results for the top k citation recommendations, we use Mean Average Precision (MAP@k) [2,21] as in related work. MAP@k is the mean of the Average Precision at k (AP@k) scores over the query documents. The metric AP@k assumes that a user is interested in finding many relevant documents and is thus an appropriate evaluation metric for citation recommendation
Our results indicate that the exploitation of a research KG as an additional source of information improves the task of citation recommendation
** Os conceitos podem estar relacionados através de conceitos pais e filhos ou outro tipo de relação como todo/parte que poderiam estar representados no KG mas que acabam se perdendo ao converter para one-hot **
GitHub -> https://github.com/arthurbra/citation-recommendation-kg
GitHUb do STM-KG -> https://github.com/arthurbra/stm-coref
Artigo do STM-KG
Brack A., Müller D.U., Hoppe A., Ewerth R. (2021) Coreference Resolution in Research Papers from Multiple Domains. In: Hiemstra D., Moens MF., Mothe J., Perego R., Potthast M., Sebastiani F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science, vol 12656. Springer, Cham. https://doi.org/10.1007/978-3-030-72113-8_6
Comentários
Postar um comentário
Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.