Pular para o conteúdo principal

Artigo: A Web-scale system for scientific knowledge explorationn ... MAG Generation (first version)

Zhihong Shen, Hao Ma, Kuansan Wang:
A Web-scale system for scientific knowledge exploration.  
Association for Computational Linguistics (ACL) 2018: 87-92


  1. identify hundreds of thousands of scientific concepts,
  2. tag these identified concepts to hundreds of millions of scientific publications by leveraging both text and graph structure, and
  3. build a six-level concept hierarchy with a subsumption-based model.

Contribution: A cross-domain scientific concept (concepts or fields-of-study) ontology published to date, with more than 200 thousand concepts and over one million relationships.


Scalability: Traditionally, academic discipline and concept taxonomies have been curated manually on a scale of hundreds or thousands, which is insufficient in modeling the richness of academic concepts across all domains. Consequently, the low concept coverage also limits the exploration experience of hundreds of millions of scientific publications.

Passo 1

We formulate concept discovery as a knowledge base type prediction problem and use graph link analysis to guide the process. 

Wikipedia  articles  as  the  source  of  concept  discovery and each article is an entity in a general knowledge  base  (KB). There were previously defined 19 top level (“L0”) disciplines (such as physics, medicine) and 294 second level (“L1”) sub-domains (examples are machine learning, algebra) manually curated by referencing existing classification (see https://science-metrix.com/en/classification *) and get  their  correspondent Wikipedia  entities  in  a general in-house knowledge base (KB).

* "A number of approaches have been used to design journal-level taxonomies or ontologies, and the scholarly research and practical application of these systems have revealed their various benefits and limitations. To date, however, no single classification scheme has been widely adopted by the international bibliometric community.

= Dados de entrada, 2000 entidades iniciais na FoS

Graph  link  analysis:  To  drive  the  process  of exploring  new  FoS  candidates,  we  apply  the  intuition  that  if  the  majority  of  an  entity’s  nearest neighbours are FoS, then it is highly likely an FoS as  well.   To  calculate  nearest  neighbours,  a  distance measure between two Wikipedia entities is required.   We  use  an  effective  and  low-cost  approach based on Wikipedia link analysis to compute  the  semantic  closeness  (Milne  and  Witten, 2008). We label a Wikipedia entity as an FoS candidate if there are more than K([35,45]) neighbours in its top N (N=100) nearest ones are in a current FoS set. 

= Para cada entidade e(W) da Wikipedia (que ainda não pertence a FoS) obtém os 100 vizinhos mais próximos e verifica se dentre as entidades vizinhas {ve(W)} dessa entidade existem entre 35 a 45 que são entidades e(FoS) da FieldOfStudy, se atender então e(W) é uma entidade candidata a se tornar e(FoS). Se o tipo da entidade candidata, de acordo com um KB próprio de tipos (Freebase), for do interesse, então a e(W) é acrescentada na e(FoS). 

= Processo iterativo de exploração da vizinhança, se repete até que novas entidades candidatas não sejam mais encontradas (!)

= Resultado 228 mil entidades para FoS

Passo 2

We formulate the concept tagging as a multi-label classification problem; i.e. each publication could be tagged with multiple FoS as appropriate. 

We first define simple representing text (or SRT) and extended representing text (or ERT) as the concept’s and publication’s textual representations.

SRT is the  text  used  to  describe  the  academic  entity  it-self.
-  a publishing venue’s full name (i.e.  the journal name or the conference name)
-  the first paragraph of a concept’s Wikipedia article
-  textual meta data,  such as title, keywords, and abstract of publications

ERT is the extension of SRT and leverages the graph structural information to include textual information from its neighbouring nodes in MAG.
- venue ERT is the concatenation of a sample subset SRTs of publications from a given venue, besides the venue SRT
- manually curate concept-venue pairs and aggregate ERT of venues associated with a given concept to obtain the ERT for the concept.
- a publication’s ERT includes SRT from its citations and references and its ERT linked publishing venue.

= o ERT de um veículo de publicação depende do seu próprio SRT e do SRT de uma amostra de suas publicações

= o ERT de um conceito depende da relação veículos-conceitos que foi definida manualmente e do ERT dos veículos associados a esse conceito

= o ERT de uma publicação depende do seu próprio SRT, do SRT das citações, das referências e o ERT do veículo de publicação. 

Four  types  of  features  are  extracted  from  the text representations:  bag-of-words (BoW), bag-of-entities (BoE), embedding-of-words (EoW),  and  embedding-of-entities (EoE). 

BoW is a simplifying representation of a text (such as a sentence or a document) as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

EoW is the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.

= As features representam cada publicação, veículos e conceito em um espaço vetorial multidimensional (o número de dimensões dependem do número de palavras ou entidades envolvidas)


 These features are concatenated for the vector representation h used in Equation 1 and 2.  

Weight w is used to discount different neighbours’ impact as appropriate. We initialize hps and hvs based on textual feature vectors and adopt empirical weight values to directly compute hpe and hve to make it scalable.

After calculating the similarity for about 50 billion concept-publication pairs, i. e., cosine similarity between these vector representations. Close to 1 billion are finally picked based on the threshold set by the confidence score

= O vetor hpe é A e o vetor hce é B


Recuros adicionais usados no processamento

Wiki word vectors

We are publishing pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.

Word vectors for 157 languages

We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish. *** Esse tem opção de usar em python ***

Passo3

The concept hierarchy (L2 to L5 levels) were build automatically based on concept-document tagging results.  They extended the notion  of subsumption (a form of co-occurrence) to associate related terms. We say term x subsumes y (i. e. x is the parent of y) if y occurs only in a subset of the documents that x occurs in. The subsumption can be relaxed to 80%. 

We extend the concept co-occurrence calculation weighted with the concept-document  pair’s  (k) confidence  score  from  previous step.  More formally,  we define a weighted relative coverage score between two concepts i and j. This approach does not enforce single parent for any FoS, it results in a directed acyclic graph (DAG) hierarchy.

 

= Somar a similaridade de cada par conceito(i)-publicação da interseção dos conceitos i e j e dividir pelo somatório da similaridade de cada par conceito-publicação associado ao conceito i  -> A

= Somar a similaridade de cada par conceito(j)-publicação da interseção dos conceitos i e j e dividir pelo somatório da similaridade de cada par conceito-publicação associado ao conceito j  -> B

= Calcular a diferença entre A e B. Se maior que [0.2,0.5] então i é subordinado a j na hierarquia.

Avaliação

= 500 conceitos e 500 pares de conceito-publicação foram dividos em 5 grupos de 100 itens e associados a um julgador humano

= 500 pares de conceito-subconceito foram dividos em 5 grupos de 100 pares e cada grupo foi associado a 3 julgadores humanos. A votação de maioria foi usada para desempate. As hierarquias apresentaram conflitos de tipos (hiponímia, meronímia)

 

 

Vídeo adicional sobre o artigo -> https://youtu.be/ZnP9afs6-WU

Apresentação sobre esse artigo, aguardando data para ser apresentada para o grupo BioBD


Comentários

Postar um comentário

Sinta-se a vontade para comentar. Críticas construtivas são sempre bem vindas.

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Embedding Logical Queries on Knowledge Graphs - Leitura de Artigo

William L. Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, Jure Leskovec: Embedding Logical Queries on Knowledge Graphs . NeurIPS 2018: 2030-2041 Abstract Learning low-dimensional embeddings of knowledge graphs is a powerful approach used to predict unobserved or missing edges between entities. However, an open challenge in this area is developing techniques that can go beyond simple edge prediction and handle more complex logical queries, which might involve multiple unobserved edges, entities, and variables. [ Link Prediction é a tarefa mais comum em GRL, é uma query do tipo <s, p, ?o> ou <s, ?p, o> ou <?s, p, o>, ou seja, Look up ou Existe <s, p, o> (ASK) ] For instance, given an incomplete biological knowledge graph, we might want to predict what drugs are likely to target proteins involved with both diseases X and Y? —a query that requires reasoning about all possible proteins that might interact with diseases X and Y. [ Query conjuntiva, BGP com join ...

Knowledge Graphs as a source of trust for LLM-powered enterprise question answering - Leitura de Artigo

J. Sequeda, D. Allemang and B. Jacob, Knowledge Graphs as a source of trust for LLM-powered enterprise question answering, Web Semantics: Science, Services and Agents on the World Wide Web (2025), doi: https://doi.org/10.1016/j.websem.2024.100858. 1. Introduction These question answering systems that enable to chat with your structured data hold tremendous potential for transforming the way self service and data-driven decision making is executed within enterprises. Self service and data-driven decision making in organizations today is largly made through Business Intelligence (BI) and analytics reporting. Data teams gather the original data, integrate the data, build a SQL data warehouse (i.e. star schemas), and create BI dashboards and reports that are then used by business users and analysts to answer specific questions (i.e. metrics, KPIs) and make decisions. The bottleneck of this approach is that business users are only able to answer questions given the views of existing dashboa...