Zhihong Shen, Hao Ma, Kuansan Wang:
A Web-scale system for scientific knowledge exploration.
Association for Computational Linguistics (ACL) 2018: 87-92
- identify hundreds of thousands of scientific concepts,
- tag these identified concepts to hundreds of millions of scientific publications by leveraging both text and graph structure, and
- build a six-level concept hierarchy with a subsumption-based model.
Contribution: A cross-domain scientific concept (concepts or fields-of-study) ontology published to date, with more than 200 thousand concepts and over one million relationships.
Scalability: Traditionally, academic discipline and concept taxonomies have been curated manually on a scale of hundreds or thousands, which is insufficient in modeling the richness of academic concepts across all domains. Consequently, the low concept coverage also limits the exploration experience of hundreds of millions of scientific publications.
Passo 1
We formulate concept discovery as a knowledge base type prediction problem and use graph link analysis to guide the process.
Wikipedia articles as the source of concept discovery and each article is an entity in a general knowledge base (KB). There were previously defined 19 top level (“L0”) disciplines (such as physics, medicine) and 294 second level (“L1”) sub-domains (examples are machine learning, algebra) manually curated by referencing existing classification (see https://science-metrix.com/en/classification *) and get their correspondent Wikipedia entities in a general in-house knowledge base (KB).
* "A number of approaches have been used to design journal-level taxonomies or ontologies, and the scholarly research and practical application of these systems have revealed their various benefits and limitations. To date, however, no single classification scheme has been widely adopted by the international bibliometric community."
= Dados de entrada, 2000 entidades iniciais na FoS
Graph link analysis: To drive the process of exploring new FoS candidates, we apply the intuition that if the majority of an entity’s nearest neighbours are FoS, then it is highly likely an FoS as well. To calculate nearest neighbours, a distance measure between two Wikipedia entities is required. We use an effective and low-cost approach based on Wikipedia link analysis to compute the semantic closeness (Milne and Witten, 2008). We label a Wikipedia entity as an FoS candidate if there are more than K([35,45]) neighbours in its top N (N=100) nearest ones are in a current FoS set.
= Para cada entidade e(W) da Wikipedia (que ainda não pertence a FoS) obtém os 100 vizinhos mais próximos e verifica se dentre as entidades vizinhas {ve(W)} dessa entidade existem entre 35 a 45 que são entidades e(FoS) da FieldOfStudy, se atender então e(W) é uma entidade candidata a se tornar e(FoS). Se o tipo da entidade candidata, de acordo com um KB próprio de tipos (Freebase), for do interesse, então a e(W) é acrescentada na e(FoS).
= Processo iterativo de exploração da vizinhança, se repete até que novas entidades candidatas não sejam mais encontradas (!)
= Resultado 228 mil entidades para FoS
Passo 2
We formulate the concept tagging as a multi-label classification problem; i.e. each publication could be tagged with multiple FoS as appropriate.
We first define simple representing text (or SRT) and extended representing text (or ERT) as the concept’s and publication’s textual representations.
SRT is the text used to describe the academic entity it-self.
- a publishing venue’s full name (i.e. the journal name or the conference name)
- the first paragraph of a concept’s Wikipedia article
- textual meta data, such as title, keywords, and abstract of publications
ERT is the extension of SRT and leverages the graph structural information to include textual information from its neighbouring nodes in MAG.
- venue ERT is the concatenation of a sample subset SRTs of publications from a given venue, besides the venue SRT
- manually curate concept-venue pairs and aggregate ERT of venues associated with a given concept to obtain the ERT for the concept.
- a publication’s ERT includes SRT from its citations and references and its ERT linked publishing venue.
= o ERT de um veículo de publicação depende do seu próprio SRT e do SRT de uma amostra de suas publicações
= o ERT de um conceito depende da relação veículos-conceitos que foi definida manualmente e do ERT dos veículos associados a esse conceito
= o ERT de uma publicação depende do seu próprio SRT, do SRT das citações, das referências e o ERT do veículo de publicação.
Four types of features are extracted from the text representations: bag-of-words (BoW), bag-of-entities (BoE), embedding-of-words (EoW), and embedding-of-entities (EoE).
BoW is a simplifying representation of a text (such as a sentence or a document) as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
EoW is the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.
= As features representam cada publicação, veículos e conceito em um espaço vetorial multidimensional (o número de dimensões dependem do número de palavras ou entidades envolvidas)
These features are concatenated for the vector representation h used in Equation 1 and 2.
Weight w is used to discount different neighbours’ impact as appropriate. We initialize hps and hvs based on textual feature vectors and adopt empirical weight values to directly compute hpe and hve to make it scalable.
After calculating the similarity for about 50 billion concept-publication pairs, i. e., cosine similarity between these vector representations. Close to 1 billion are finally picked based on the threshold set by the confidence score
= O vetor hpe é A e o vetor hce é B
Recuros adicionais usados no processamento
Wiki word vectors
We are publishing pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.
Word vectors for 157 languages
We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish. *** Esse tem opção de usar em python ***
Passo3
The concept hierarchy (L2 to L5 levels) were build automatically based on concept-document tagging results. They extended the notion of subsumption (a form of co-occurrence) to associate related terms. We say term x subsumes y (i. e. x is the parent of y) if y occurs only in a subset of the documents that x occurs in. The subsumption can be relaxed to 80%.
We extend the concept co-occurrence calculation weighted with the concept-document pair’s (k) confidence score from previous step. More formally, we define a weighted relative coverage score between two concepts i and j. This approach does not enforce single parent for any FoS, it results in a directed acyclic graph (DAG) hierarchy.
= Somar a similaridade de cada par conceito(i)-publicação da interseção dos conceitos i e j e dividir pelo somatório da similaridade de cada par conceito-publicação associado ao conceito i -> A
= Somar a similaridade de cada par conceito(j)-publicação da interseção dos
conceitos i e j e dividir pelo somatório da similaridade de cada par conceito-publicação associado ao conceito j -> B
= Calcular a diferença entre A e B. Se maior que [0.2,0.5] então i é subordinado a j na hierarquia.
Avaliação
= 500 conceitos e 500 pares de conceito-publicação foram dividos em 5 grupos de 100 itens e associados a um julgador humano
= 500 pares de conceito-subconceito foram dividos em 5 grupos de 100 pares e cada grupo foi associado a 3 julgadores humanos. A votação de maioria foi usada para desempate. As hierarquias apresentaram conflitos de tipos (hiponímia, meronímia)
Vídeo adicional sobre o artigo -> https://youtu.be/ZnP9afs6-WU
Apresentação sobre esse artigo, aguardando data para ser apresentada para o grupo BioBD
Não foi apresentado ainda!
ResponderExcluir