Pular para o conteúdo principal

Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding - Leitura de Artigo

SEMANTIC SCHOLAR

Chenyan Xiong, Russell Power, and Jamie Callan. 2017. Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1271–1279. DOI:https://doi.org/10.1145/3038912.3052558

Abstract

Analysis of the query log from our academic search engine, SemanticScholar.org, reveals that a major error source is its inability to understand the meaning of research concepts in queries. To addresses this challenge, ESR represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledge graph embedding. Experiments demonstrate ESR’s ability in improving Semantic Scholar’s online production system, especially on hard queries where word-based ranking fails.

Introduction

Its current production ranking system is based on the word-based model in ElasticSearch that matches query terms with various parts of a paper, combined with document features such as citation count and publication time in a learning to rank architecture [15].

Analysis of S2’s query logs found that a large fraction of the online traffic is ad-hoc queries about computer science concepts or research topics. The information needs behind such queries sometimes are hard for term-frequency based ranking models to fulfill.

This paper introduces Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding. 

We first build an academic knowledge graph using S2’s corpus and Freebase.  The knowledge graph includes CONCEPT ENTITIES, their descriptions, context correlations, relationships with authors and venues, and embeddings trained from the graph structure. We apply this knowledge graph and embeddings to our ranking task.

Semantic relatedness between query and document entities is computed in the embedding space, which provides a soft matching between related entities. ESR uses a two-stage pooling to generalize these entity-based matches to query-document ranking features and uses a learning to rank model to combine them.

Embeddings of entities from the knowledge graph can also be leveraged to provide a soft match signal, allowing ranking of documents that are semantically related but do not match the exact query terms. 

Further analysis confirms that both exact match and soft match in the entity space provide effective and novel ranking signals. These signals are successfully utilized by ESR’s ranking framework, and greatly improve the queries that S2’s word-based ranking system finds hard to deal with.

Conclusion

In ESR, queries and documents are represented in the entity space using their annotations, and the ranking is defi ned by their semantic relatedness described by their entities' connections, in an embedding, pooling, and ranking framework.

Perhaps surprisingly, we found that using Freebase entities was more e ffective than keyphrases automatically extracted from S2's corpus. Although Freebase is considered general-purpose, it provides surprisingly good coverage of the entities in our domain - computer science.

This paper presents a new method of using knowledge graphs to improve the ranking of academic search. Although the details of this work are speci c to Semantic Scholar (S2), the techniques and lessons learned are general and can be applied to other full-text search engines.

Related Work

Soft match is a widely studied topic in information retrieval, mostly in word-based search systems.  

Translation models treat ranking as translations between query terms and document terms using a translation matrix [3].  

Topic modeling techniques have been used to first map query and document into a latent space, and then matching them in it [24].  

Word embedding and deep learning techniques have been studied recently. One possibility is to first build query and document representations using their words’ embeddings heuristically, and then match them in the embedding space [22].  

The DSSM model directly trains a representation model using deep neural networks, which learns distributed representations for the query and document, and matches them using the learned representations [13]. 

A more recent method, DRMM, models the query-document relevance with a neural network built upon the word-level translation matrix [11]. The translation matrix is calculated with pretrained word embeddings. The word-level translation scores are summarized by bin-pooling (histograms) and then used by the ranking neural network.

The recent development of knowledge graphs has motivated many new techniques in utilizing knowledge bases for text-centric information retrieval [9]. An intuitive way is to use the textual attributes of related entities to enrich the query representation.

Another way to utilize knowledge graphs in search is to use the entities as a source of additional connections from query to documents.

A more recent trend is to build entity-based text representations and improve word-based ranking with entity-based retrieval models. Entity-based language model uses the surface forms and entity names of document annotations to build an entity-oriented language model [12]. It successfully improves the retrieval accuracy when combined with typical word-based retrieval. A similar idea is used in the bag-of-entities model, in which the query and documents are represented by their entity annotations [27]. Boolean and frequency-based models in the entity space can improve the ranking accuracy of word-based retrieval. ESR further explores the entity-based representation’s potential with knowledge graph embeddings and soft matches, and uses the knowledge graph in a new domain: academic search.

QUERY LOG ANALYSIS

More than half of S2’s head traffic is research concept related: searching for an academic concept (39%) and exploring research topics (15%) (e.g. `forensics and machine learning'). About one-third is searching an author (36%). ...
The queries are usually short, on average about 2-3 terms, and their information needs are not as clear as the author, venue, or paper title queries. They are very typical ad-hoc search queries, in a specific domain – computer science.

The fi rst part of the difficulty is noise in the exact-match signal. Due to language variety, the query term may not appear frequently enough in the relevant documents (vocabulary mismatch), for example, `softmax categorization' versus `softmax classi cation'. The segmentation of query concepts is also a problem. For example, the whole query `natural language interface' should be considered as a whole because it is one informative unit, but the ranking model matches all words and n-grams of the query, and the result is dominated by the popular phrase `natural language'.

The second part is more subtle as it is about the meaning of the query concept. There can be multiple aspects of the query concept, and a paper that mentions it the most frequently may not be about the most important aspect. For example, `ontology construction' is about how to construct ontologies in general, but may not be about how a speci c ontology is constructed; `dynamic programming segmentation' is about word segmentation in which dynamic programming is essential, but is not about image segmentation. To conclude, our analysis finds that there is a gap between query-documents' textual similarity and their semantic relatedness. Ranking systems that rely solely on term-level statistics have few principled ways to resolve this discrepancy.

EXPLICIT SEMANTIC RANKING

Knowledge Graph

In this work, we build a standard knowledge graph G with concept entities (E) and edges (predicates P and tails T) by harvesting S2’s corpus and Freebase, and use a popularity based entity linking system to link query and documents.

CONCEPT ENTITIES in our knowledge graph can be collected from two different sources: corpus-extracted (Corpus) and Freebase. Corpus entities are keyphrases automatically extracted from S2’s corpus. Keyphrases extraction is a widely studied task that aims to find representative keyphrases for a document, for example, to approximate the manually assigned keywords of a paper....The second source of entities is Freebase. Despite being a general domain knowledge base, our manual examination found that Freebase has rather good coverage on the computer science concept entities in S2’s head queries

Edges in our knowledge graph include two parts: Predicates P and tails T. From Freebase and the CMNS annotated S2 corpus, the following four types of edges are harvested:
Author edges link an entity to an author if the author has a paper which mentioned the entity in the title;
Context edges link an entity to another entity if the two co-occur in a window of 20 words more than 5
times in the corpus;
Desc edges link an entity to words in its Freebase description (if one exists); and
Venue edges link an entity to a venue if the entity appears in the title of a paper published in the venue.

The knowledge graph G contains two types of Concept Entities: Corpus-extracted (Corpus) and Freebase, and four types of edges: Author, Context, Desc(ription) and Venue. 

Embeddings of our entities are trained based on their neighbors in the knowledge graph. The graph structure
around an entity conveys the semantics of this entity. Intuitively, entities with similar neighbors are usually related. We use entity embeddings to model such semantics.

Ranking Model

 

Given a query q, and a set of candidate documents D = {d1,...,dn}, the goal of ESR is to find a ranking function
f(q,d|G), that better ranks D using the explicit semantics stored in the knowledge graph G. The explicit semantics include entities (E = {e1,...,e|E|}) and edges (predicates P and tails T).

ESR represents query and documents by their bag-of-entities constructed using their entity annotations (Entity Linking)

ESR matches query and documents’ entity representations using the knowledge graph embedding

ESR first calculates a query-document entity translation matrix. Each element in the matrix is the connection
strength between a query entity ei and a document entity ej , calculated by their embeddings’ cosine similarity. Scores less than 1 identify related entities as a function of the knowledge graph structure and provide soft match signals. 

Cada entrada da matriz Entity Similarity Matrix representa a similaridade entre a entidade ei que pertence a query (Eq) com a entidade ej do documento (Ed)

Then ESR performs two pooling steps

The first step is a max-pooling along the query dimension ... The max-pooling matches each document entity to its closest query entity using embeddings, which is the exact-match if one exists. Its score describes how closely related a document entity is to the query. ... is used to match each document entity to its closest query entity.

O vetor max-pooled chamado S(d) tem a mesma dimensão do vetor de entidades Ed 

Cada entrada corresponde ao máximo entre as similaridades da entidades ej do Ed com as demais entidades ei de Eq

The second step is a FIVE bin-pooling (histogram) to count the matches at different strengths ...
The bin-pooling counts the number of document entities with different connection strengths to the query. ... is used to summarize the matching scores of each document entity into match frequencies at di erent strengths.

FIVE bins: [1,1], ]1, 0.75], ]0.75, 05.], ]0.5, 0.25], ]0.25, 0[

(stk, edk) is the range for the kth bin. Bk is the number of document entities whose scores fall into this bin.

where fS2(q; d) is the score from S2's production system, w0 and W are the parameters to learn, and f(q, d | G) is the fi nal ranking score

EXPERIMENTAL METHODOLOGY

The queries of the benchmark are sampled from S2's query log in the rst six months of 2016.

Baselines: The baseline is the Semantic Scholar (S2) production system on July 1st 2016 .... We
also include BM25-F and tf.idf-F for reference. The BM25 and vector space model are applied to the paper's title, abstract, and body elds. Their parameters ( field weights) are learned using the same learning to rank model as our method in the same cross-validation setting.

All other settings follow previous standards. The embedding dimensionality is 300; The five bins and three paper fi elds (title, abstract, and body) generate 15 features, which are combined with S2's original score using linear RankSVM [14]. All models and baselines' parameters are trained and evaluated using a 10-fold cross validation with 80% train, 10% development and 10% test in each fold. The hyper-parameter `c' of RankSVM is selected from {0:0001; 0:0005; 0:001; 0:005; 0:01; 0:05; 0:1; 0:5; 1} using the development part of each fold.

EVALUATION RESULTS

ESR provides a further jump over the production system by at least 5% on all evaluation metrics and with all edge types. With Venue, the improvements are consistently higher than 10%. On the earlier positions which are more important for user satisfaction, ESR's performances are also stronger,
with about 2 - 3% more improvements on NDCG@1 and NDCG@5 than on NDCG@20.

On the queries where S2's word-based ranking fails, for example, on those whose NDCG < 0:3, the semantics from the knowledge graph improve many of them with large margins. The improvements are also robust. More than half of the queries are improved with big wins. Fewer queries are hurt and the loss is usually small.

ESR matches query and documents on their entity representations using the semantics from the knowledge graph. The semantic contribution to ESR's ranking can come from two aspects: exact match with smart phrasing and soft match with knowledge graph embedding.

Entities' exact match information (the fi rst bin) provides about 6% gains over S2, with Context and Venue.

The soft match information (later bins) contributes approximately another 5% of improvement. The soft-match bins record how many document entities are related to the query entities at certain strengths. Intuitively, the soft match should contribute for all queries as knowing more about the document entities should always help. But our examination finds this information is more e ective on some queries,

This experiment helps us understand why the explicit semantics in knowledge graphs is useful for ranking. By knowing which entity a phrase refers to and whether di fferent phrases are about the same thing, the exact match signal is polished. By knowing the relatedness between query entities and document entities through their embeddings, additional soft match connections that describe query-document relatedness with semantics are incorporated.

ESR captures the semantic relatedness between entities by using their distance in the embedding space. An advantage of using embeddings compared with the raw knowledge graph edges is their eficiency, which is tightly constrained in online search engines. By embedding the neighbors of an entity into a dense continuous vector, at run time, ESR avoids dealing with the graph structure of the knowledge
graph, which is sparse, of varying length, and much more expensive to deal with than fi xed length dense vectors. However, does ESR sacrifi ce e ffectiveness for eficiency? The fourth experiment studied the influence of embeddings on ESR's e ffectiveness, by comparing it with the ranking performance of using raw knowledge graph edges. In this experiment, a discrete vector representation is formed for each entity for each edge type. Each of the vector's dimensions is a neighbor of the entity. Its weight is the frequency of them being connected together. The Raw variants of ESR are then formed with the knowledge graph embedding replaced by this discrete entity representation and everything
else kept the same.

The result shows that ESR actually bene fits from the knowledge graph embedding; Raw methods almost always perform worse than the corresponding ESR version. We believe that the advantage of knowledge graph embedding is similar with word embedding [18]: the embedding better captures the semantics by factoring the raw sparse data into a smooth low-dimensional space.

We believe ESR's two-stage pooling provides the right balance of model complexity and abstraction granularity, given the size of our training data.

 

 

Comentários

Postagens mais visitadas deste blog

Connected Papers: Uma abordagem alternativa para revisão da literatura

Durante um projeto de pesquisa podemos encontrar um artigo que nos identificamos em termos de problema de pesquisa e também de solução. Então surge a vontade de saber como essa área de pesquisa se desenvolveu até chegar a esse ponto ou quais desdobramentos ocorreram a partir dessa solução proposta para identificar o estado da arte nesse tema. Podemos seguir duas abordagens:  realizar uma revisão sistemática usando palavras chaves que melhor caracterizam o tema em bibliotecas digitais de referência para encontrar artigos relacionados ou realizar snowballing ancorado nesse artigo que identificamos previamente, explorando os artigos citados (backward) ou os artigos que o citam (forward)  Mas a ferramenta Connected Papers propõe uma abordagem alternativa para essa busca. O problema inicial é dado um artigo de interesse, precisamos encontrar outros artigos relacionados de "certa forma". Find different methods and approaches to the same subject Track down the state of the art rese...

Knowledge Graph Embedding with Triple Context - Leitura de Abstract

  Jun Shi, Huan Gao, Guilin Qi, and Zhangquan Zhou. 2017. Knowledge Graph Embedding with Triple Context. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 2299–2302. https://doi.org/10.1145/3132847.3133119 ABSTRACT Knowledge graph embedding, which aims to represent entities and relations in vector spaces, has shown outstanding performance on a few knowledge graph completion tasks. Most existing methods are based on the assumption that a knowledge graph is a set of separate triples, ignoring rich graph features, i.e., structural information in the graph. In this paper, we take advantages of structures in knowledge graphs, especially local structures around a triple, which we refer to as triple context. We then propose a Triple-Context-based knowledge Embedding model (TCE). For each triple, two kinds of structure information are considered as its context in the graph; one is the out...

KnOD 2021

Beyond Facts: Online Discourse and Knowledge Graphs A preface to the proceedings of the 1st International Workshop on Knowledge Graphs for Online Discourse Analysis (KnOD 2021, co-located with TheWebConf’21) https://ceur-ws.org/Vol-2877/preface.pdf https://knod2021.wordpress.com/   ABSTRACT Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts . This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. While knowledge graphs promise to provide the key to a Web of structured information, they are mainly focused on facts without keeping track of the diversity, connection or temporal evolution of online discourse data. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a vari...