Pular para o conteúdo principal

Citation Recommendation for Research Papers via Knowledge Graphs - Leitura de Artigo

Brack A., Hoppe A., Ewerth R. (2021) Citation Recommendation for Research Papers via Knowledge Graphs. In: Berget G., Hall M.M., Brenn D., Kumpulainen S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2021. Lecture Notes in Computer Science, vol 12866. Springer, Cham. https://doi.org/10.1007/978-3-030-86324-1_20

Abstract

Citation recommendation for research papers is a valuable task that can help researchers improve the quality of their work by suggesting relevant related work. Current approaches for this task rely primarily on the text of the papers and the citation network. 

** O SPECTER seria o estado da arte no sentido da proposta com melhores resultados **

In this paper, we propose to exploit an additional source of information, namely research knowledge graphs (KGs) that interlink research papers based on mentioned scientific concepts. 

** O KG é composto de conceitos e documentos **

Our experimental results demonstrate that the combination of information from research KGs with existing state-of-the-art approaches is beneficial. Experimental results are presented for the STM-KG (STM: Science, Technology, Medicine), which is an automatically populated knowledge graph based on the scientific concepts extracted from papers of ten domains. 

** STM-KG **

1 Introduction

... recommendation of suitable references for a piece of scientific writing is an important task to (a) improve the quality of future publications, (b) help authors and reviewers to point out additional relevant related work, and (c) discover interesting links to other areas of research.

... (1) local citation recommendation which aims to provide citations for a short passage of text, and global citation recommendation which uses the documents’ full text or abstract as the input. Here, we focus on the task of global citation recommendation.

** Citação global seria mais voltado para trabalhos relacionados ** 

... In this paper, we explore another source of information, that is the set of scientific concepts which are mentioned in the article. The assumptions are (1) that additionally to the article’s text, these provide condensed evidence to the described problem statement, used methodology or evaluation metrics, and (2) that research papers which should be citing each other usually share a similar set of concepts. 

** A ligação entre os documentos seriam os conceitos em comum. Com o vetor one-hot é possível identificar tanto as igualdades quanto as diferenças ** 

2 Related Work

2.1 Research Knowledge Graphs

Various KGs interlink research papers through metadata (e.g. authors, venues) and citations, or through research artefacts (e.g. datasets).

** MAG, Research Graph, 2ORC, OpenAIRE  **

Other initiatives organise scientific knowledge in a structured manner with community effort, such as Gene Ontology, WikiData with encyclopaedic knowledge, or Papers With Code and Open Research Knowledge Graph (ORKG) for research contributions.

Furthermore, various KGs have been populated automatically from research articles. Computer Science Ontology (CSO) is a taxonomy for computer science research areas. Kannan et al. create a multimodal KG for deep learning papers from text and images and the corresponding source code. The AI-KG has been generated from 333,000 research papers in the field of artificial intelligence (AI). It contains five
concept types (tasks, methods, metrics, materials, others) linked by 27 relations types. The COVID-19 KG has been populated from the Covid-19 Open Research Dataset (CORD-19) and contains various biological concept entities. Brack et al. generate a KG (
STM-KG) for ten science domains with the concept types material, method, process, and data.    

** Vários exemplos de KG construídos manualmente ou automaticamente usando dados e/ou metadados de publicações científicas ** 

** AI-KG includes about 14M RDF triples and 1.2M reified statements extracted from 333K research publications in the field of AI, and describes 5 types of entities (tasks, methods, metrics, materials, others) linked by 27 relations. ... AI-KG is available under CC BY 4.0 and can be downloaded as a dump or queried via a SPARQL endpoint. ** 

2.2 Citation Recommendation

Bhagavatula et al. propose a neural network-based document embedding model to retrieve candidate documents for a query document via similarity search and a ranking model to rerank the top-k candidates. Cohan et al. propose a document embedding model named SPECTER (Scientific Paper Embeddings using Citationinformed TransformERs). 

** Abordagens de ML em corpus de documentos para solução do problema **
** Text embeddings and the citation graph -> SPECTER **

Graph-based representation learning approaches learn document embeddings via graph convolution networks on the citation graph. However, they require the citation network also at inference time. Other approaches frame citation recommendation as a binary classification task: given a query and a candidate paper, the model learns to predict whether the query paper should cite the candidate paper. The models learn rich relationships between the contents of the two documents via various cross-document attention mechanisms. However, in contrast to the document embedding models, such binary classification models can not be used for retrieval but only for reranking the top k results, since a query paper has to be compared with all other documents.

** Abordagem de ML para estimar a similaridade entre o paper e o que deveria ser citado **

To the best of our knowledge, approaches for citation recommendation that exploit knowledge graphs with scientific concepts have not been proposed yet.

3 Citation Recommendation via a Research Knowledge Graph

.. we propose an approach to combine document embeddings learned from textual content and the citation graph together with scientific concepts mentioned in the document.

Let KG = (D, E, V ) be a KG, D the set of documents, E the set of concepts, V D × E the set of links between papers and concepts, and Ed E the set of concepts mentioned in paper d D. Let one hot(ei) R|E| be the one-hot vector for concept ei in which the i-th component equals 1 and all remaining components are 0.

** O KG é baseado nos conceitos que os papers possuem **
** One hot é a geração do vetor com 0 e 1, sendo 1 na posição correspondente ao conceito que o paper está relacionado **

Furthermore, let sd be a document embedding of paper d obtained via an existing document embedding model (e.g. SPECTER [9]). The vector representation ~d of paper d is the concatenation of the concept vector cd and the document embedding sd:

** Gera o vetor de embedding do documento com alguma técnica e concatena com o vetor one-hot dos conceitos e o resultado passar ser o vetor de representação da entidade **

For a query paper q D the task is to retrieve the top k results such that papers to be cited appear at the top of the list. We use cosine similarity for retrieval and ranking:

 

** A similaridade serve para recuperar e ordenar mas tem que fazer o cálculo para todas as entidades do grafo ** 

4 Experimental Setup

Benchmark Dataset: ... we use the STM-KG [6] as our benchmark dataset ... ten different scientific, technical, and medical domains and comes in two variants: (1) in-domain KG that shares scientific concepts only between papers of the same domain to avoid ambiguity of scientific terms (e.g. neural network in medicine vs. computer science), and (2) cross-domain KG that shares scientific concepts also between domains. 

** Se é conceito pq a grafia da palavra iria atrapalhar ???? **  

SciBERT: Document embedding is also the average of the contextual word embeddings obtained from the abstract of the paper via SciBERT that is based on BERT and has been pre-trained on scientific text. It has demonstrated superior performance in various downstream tasks on research papers.

** Modelo pré treinado em corpus de texto científico em inglês ** 

Evaluation: To evaluate the quality of the ranking results for the top k citation recommendations, we use Mean Average Precision (MAP@k) [2,21] as in related work. MAP@k is the mean of the Average Precision at k (AP@k) scores over the query documents. The metric AP@k assumes that a user is interested in finding many relevant documents and is thus an appropriate evaluation metric for citation recommendation

Our results indicate that the exploitation of a research KG as an additional source of information improves the task of citation recommendation  

** Os conceitos podem estar relacionados através de conceitos pais e filhos ou outro tipo de relação como todo/parte que poderiam estar representados no KG mas que acabam se perdendo ao converter para one-hot **

GitHub -> https://github.com/arthurbra/citation-recommendation-kg 

GitHUb do STM-KG -> https://github.com/arthurbra/stm-coref 

Artigo do STM-KG 

Brack A., Müller D.U., Hoppe A., Ewerth R. (2021) Coreference Resolution in Research Papers from Multiple Domains. In: Hiemstra D., Moens MF., Mothe J., Perego R., Potthast M., Sebastiani F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science, vol 12656. Springer, Cham. https://doi.org/10.1007/978-3-030-72113-8_6 

Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...