Pular para o conteúdo principal

Artigo: A Web-scale system for scientific knowledge explorationn ... MAG Generation (first version)

Zhihong Shen, Hao Ma, Kuansan Wang:
A Web-scale system for scientific knowledge exploration.  
Association for Computational Linguistics (ACL) 2018: 87-92


  1. identify hundreds of thousands of scientific concepts,
  2. tag these identified concepts to hundreds of millions of scientific publications by leveraging both text and graph structure, and
  3. build a six-level concept hierarchy with a subsumption-based model.

Contribution: A cross-domain scientific concept (concepts or fields-of-study) ontology published to date, with more than 200 thousand concepts and over one million relationships.


Scalability: Traditionally, academic discipline and concept taxonomies have been curated manually on a scale of hundreds or thousands, which is insufficient in modeling the richness of academic concepts across all domains. Consequently, the low concept coverage also limits the exploration experience of hundreds of millions of scientific publications.

Passo 1

We formulate concept discovery as a knowledge base type prediction problem and use graph link analysis to guide the process. 

Wikipedia  articles  as  the  source  of  concept  discovery and each article is an entity in a general knowledge  base  (KB). There were previously defined 19 top level (“L0”) disciplines (such as physics, medicine) and 294 second level (“L1”) sub-domains (examples are machine learning, algebra) manually curated by referencing existing classification (see https://science-metrix.com/en/classification *) and get  their  correspondent Wikipedia  entities  in  a general in-house knowledge base (KB).

* "A number of approaches have been used to design journal-level taxonomies or ontologies, and the scholarly research and practical application of these systems have revealed their various benefits and limitations. To date, however, no single classification scheme has been widely adopted by the international bibliometric community.

= Dados de entrada, 2000 entidades iniciais na FoS

Graph  link  analysis:  To  drive  the  process  of exploring  new  FoS  candidates,  we  apply  the  intuition  that  if  the  majority  of  an  entity’s  nearest neighbours are FoS, then it is highly likely an FoS as  well.   To  calculate  nearest  neighbours,  a  distance measure between two Wikipedia entities is required.   We  use  an  effective  and  low-cost  approach based on Wikipedia link analysis to compute  the  semantic  closeness  (Milne  and  Witten, 2008). We label a Wikipedia entity as an FoS candidate if there are more than K([35,45]) neighbours in its top N (N=100) nearest ones are in a current FoS set. 

= Para cada entidade e(W) da Wikipedia (que ainda não pertence a FoS) obtém os 100 vizinhos mais próximos e verifica se dentre as entidades vizinhas {ve(W)} dessa entidade existem entre 35 a 45 que são entidades e(FoS) da FieldOfStudy, se atender então e(W) é uma entidade candidata a se tornar e(FoS). Se o tipo da entidade candidata, de acordo com um KB próprio de tipos (Freebase), for do interesse, então a e(W) é acrescentada na e(FoS). 

= Processo iterativo de exploração da vizinhança, se repete até que novas entidades candidatas não sejam mais encontradas (!)

= Resultado 228 mil entidades para FoS

Passo 2

We formulate the concept tagging as a multi-label classification problem; i.e. each publication could be tagged with multiple FoS as appropriate. 

We first define simple representing text (or SRT) and extended representing text (or ERT) as the concept’s and publication’s textual representations.

SRT is the  text  used  to  describe  the  academic  entity  it-self.
-  a publishing venue’s full name (i.e.  the journal name or the conference name)
-  the first paragraph of a concept’s Wikipedia article
-  textual meta data,  such as title, keywords, and abstract of publications

ERT is the extension of SRT and leverages the graph structural information to include textual information from its neighbouring nodes in MAG.
- venue ERT is the concatenation of a sample subset SRTs of publications from a given venue, besides the venue SRT
- manually curate concept-venue pairs and aggregate ERT of venues associated with a given concept to obtain the ERT for the concept.
- a publication’s ERT includes SRT from its citations and references and its ERT linked publishing venue.

= o ERT de um veículo de publicação depende do seu próprio SRT e do SRT de uma amostra de suas publicações

= o ERT de um conceito depende da relação veículos-conceitos que foi definida manualmente e do ERT dos veículos associados a esse conceito

= o ERT de uma publicação depende do seu próprio SRT, do SRT das citações, das referências e o ERT do veículo de publicação. 

Four  types  of  features  are  extracted  from  the text representations:  bag-of-words (BoW), bag-of-entities (BoE), embedding-of-words (EoW),  and  embedding-of-entities (EoE). 

BoW is a simplifying representation of a text (such as a sentence or a document) as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

EoW is the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.

= As features representam cada publicação, veículos e conceito em um espaço vetorial multidimensional (o número de dimensões dependem do número de palavras ou entidades envolvidas)


 These features are concatenated for the vector representation h used in Equation 1 and 2.  

Weight w is used to discount different neighbours’ impact as appropriate. We initialize hps and hvs based on textual feature vectors and adopt empirical weight values to directly compute hpe and hve to make it scalable.

After calculating the similarity for about 50 billion concept-publication pairs, i. e., cosine similarity between these vector representations. Close to 1 billion are finally picked based on the threshold set by the confidence score

= O vetor hpe é A e o vetor hce é B


Recuros adicionais usados no processamento

Wiki word vectors

We are publishing pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.

Word vectors for 157 languages

We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish. *** Esse tem opção de usar em python ***

Passo3

The concept hierarchy (L2 to L5 levels) were build automatically based on concept-document tagging results.  They extended the notion  of subsumption (a form of co-occurrence) to associate related terms. We say term x subsumes y (i. e. x is the parent of y) if y occurs only in a subset of the documents that x occurs in. The subsumption can be relaxed to 80%. 

We extend the concept co-occurrence calculation weighted with the concept-document  pair’s  (k) confidence  score  from  previous step.  More formally,  we define a weighted relative coverage score between two concepts i and j. This approach does not enforce single parent for any FoS, it results in a directed acyclic graph (DAG) hierarchy.

 

= Somar a similaridade de cada par conceito(i)-publicação da interseção dos conceitos i e j e dividir pelo somatório da similaridade de cada par conceito-publicação associado ao conceito i  -> A

= Somar a similaridade de cada par conceito(j)-publicação da interseção dos conceitos i e j e dividir pelo somatório da similaridade de cada par conceito-publicação associado ao conceito j  -> B

= Calcular a diferença entre A e B. Se maior que [0.2,0.5] então i é subordinado a j na hierarquia.

Avaliação

= 500 conceitos e 500 pares de conceito-publicação foram dividos em 5 grupos de 100 itens e associados a um julgador humano

= 500 pares de conceito-subconceito foram dividos em 5 grupos de 100 pares e cada grupo foi associado a 3 julgadores humanos. A votação de maioria foi usada para desempate. As hierarquias apresentaram conflitos de tipos (hiponímia, meronímia)

 

 

Vídeo adicional sobre o artigo -> https://youtu.be/ZnP9afs6-WU

Apresentação sobre esse artigo, aguardando data para ser apresentada para o grupo BioBD


Comentários

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Aula 12: WordNet | Introdução à Linguagem de Programação Python *** com NLTK

 Fonte -> https://youtu.be/0OCq31jQ9E4 A WordNet do Brasil -> http://www.nilc.icmc.usp.br/wordnetbr/ NLTK  synsets = dada uma palavra acha todos os significados, pode informar a língua e a classe gramatical da palavra (substantivo, verbo, advérbio) from nltk.corpus import wordnet as wn wordnet.synset(xxxxxx).definition() = descrição do significado É possível extrair hipernimia, hiponimia, antonimos e os lemas (diferentes palavras/expressões com o mesmo significado) formando uma REDE LEXICAL. Com isso é possível calcular a distância entre 2 synset dentro do grafo.  Veja trecho de código abaixo: texto = 'útil' print('NOUN:', wordnet.synsets(texto, lang='por', pos=wordnet.NOUN)) texto = 'útil' print('ADJ:', wordnet.synsets(texto, lang='por', pos=wordnet.ADJ)) print(wordnet.synset('handy.s.01').definition()) texto = 'computador' for synset in wn.synsets(texto, lang='por', pos=wn.NOUN):     print('DEF:',s...

truth makers AND truth bearers - Palestra Giancarlo no SBBD

Dando uma googada https://iep.utm.edu/truth/ There are two commonly accepted constraints on truth and falsehood:     Every proposition is true or false.         [Law of the Excluded Middle.]     No proposition is both true and false.         [Law of Non-contradiction.] What is the difference between a truth-maker and a truth bearer? Truth-bearers are either true or false; truth-makers are not since, not being representations, they cannot be said to be true, nor can they be said to be false . That's a second difference. Truth-bearers are 'bipolar,' either true or false; truth-makers are 'unipolar': all of them obtain. What are considered truth bearers?   A variety of truth bearers are considered – statements, beliefs, claims, assumptions, hypotheses, propositions, sentences, and utterances . When I speak of a fact . . . I mean the kind of thing that makes a proposition true or false. (Russe...

DGL-KE : Deep Graph Library (DGL)

Fonte: https://towardsdatascience.com/introduction-to-knowledge-graph-embedding-with-dgl-ke-77ace6fb60ef Amazon recently launched DGL-KE, a software package that simplifies this process with simple command-line scripts. With DGL-KE , users can generate embeddings for very large graphs 2–5x faster than competing techniques. DGL-KE provides users the flexibility to select models used to generate embeddings and optimize performance by configuring hardware, data sampling parameters, and the loss function. To use this package effectively, however, it is important to understand how embeddings work and the optimizations available to compute them. This two-part blog series is designed to provide this information and get you ready to start taking advantage of DGL-KE . Finally, another class of graphs that is especially important for knowledge graphs are multigraphs . These are graphs that can have multiple (directed) edges between the same pair of nodes and can also contain loops. The...