SEMANTIC SCHOLAR
Chenyan Xiong, Russell Power, and Jamie Callan. 2017. Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1271–1279. DOI:https://doi.org/10.1145/3038912.3052558
Abstract
Analysis of the query log from our academic search engine, SemanticScholar.org, reveals that a major error source is its inability to understand the meaning of research concepts in queries. To addresses this challenge, ESR represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledge graph embedding. Experiments demonstrate ESR’s ability in improving Semantic Scholar’s online production system, especially on hard queries where word-based ranking fails.
Introduction
Its current production ranking system is based on the word-based model in ElasticSearch that matches query terms with various parts of a paper, combined with document features such as citation count and publication time in a learning to rank architecture [15].
Analysis of S2’s query logs found that a large fraction of the online traffic is ad-hoc queries about computer science concepts or research topics. The information needs behind such queries sometimes are hard for term-frequency based ranking models to fulfill.
This paper introduces Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding.
We first build an academic knowledge graph using S2’s corpus and Freebase. The knowledge graph includes CONCEPT ENTITIES, their descriptions, context correlations, relationships with authors and venues, and embeddings trained from the graph structure. We apply this knowledge graph and embeddings to our ranking task.
Semantic relatedness between query and document entities is computed in the embedding space, which provides a soft matching between related entities. ESR uses a two-stage pooling to generalize these entity-based matches to query-document ranking features and uses a learning to rank model to combine them.
Embeddings of entities from the knowledge graph can also be leveraged to provide a soft match signal, allowing ranking of documents that are semantically related but do not match the exact query terms.
Further analysis confirms that both exact match and soft match in the entity space provide effective and novel ranking signals. These signals are successfully utilized by ESR’s ranking framework, and greatly improve the queries that S2’s word-based ranking system finds hard to deal with.
Conclusion
In ESR, queries and documents are represented in the entity space using their annotations, and the ranking is defined by their semantic relatedness described by their entities' connections, in an embedding, pooling, and ranking framework.
Perhaps surprisingly, we found that using Freebase entities was more effective than keyphrases automatically extracted from S2's corpus. Although Freebase is considered general-purpose, it provides surprisingly good coverage of the entities in our domain - computer science.
This paper presents a new method of using knowledge graphs to improve the ranking of academic search. Although the details of this work are specic to Semantic Scholar (S2), the techniques and lessons learned are general and can be applied to other full-text search engines.
Related Work
Soft match is a widely studied topic in information retrieval, mostly in word-based search systems.
Translation models treat ranking as translations between query terms and document terms using a translation matrix [3].
Topic modeling techniques have been used to first map query and document into a latent space, and then matching them in it [24].
Word embedding and deep learning techniques have been studied recently. One possibility is to first build query and document representations using their words’ embeddings heuristically, and then match them in the embedding space [22].
The DSSM model directly trains a representation model using deep neural networks, which learns distributed representations for the query and document, and matches them using the learned representations [13].
A more recent method, DRMM, models the query-document relevance with a neural network built upon the word-level translation matrix [11]. The translation matrix is calculated with pretrained word embeddings. The word-level translation scores are summarized by bin-pooling (histograms) and then used by the ranking neural network.
The recent development of knowledge graphs has motivated many new techniques in utilizing knowledge bases for text-centric information retrieval [9]. An intuitive way is to use the textual attributes of related entities to enrich the query representation.
Another way to utilize knowledge graphs in search is to use the entities as a source of additional connections from query to documents.
A more recent trend is to build entity-based text representations and improve word-based ranking with entity-based retrieval models. Entity-based language model uses the surface forms and entity names of document annotations to build an entity-oriented language model [12]. It successfully improves the retrieval accuracy when combined with typical word-based retrieval. A similar idea is used in the bag-of-entities model, in which the query and documents are represented by their entity annotations [27]. Boolean and frequency-based models in the entity space can improve the ranking accuracy of word-based retrieval. ESR further explores the entity-based representation’s potential with knowledge graph embeddings and soft matches, and uses the knowledge graph in a new domain: academic search.
QUERY LOG ANALYSIS
More than half of S2’s head traffic is research concept related: searching for an academic concept (39%) and exploring research topics (15%) (e.g. `forensics and machine learning'). About one-third is searching an author (36%). ...
The queries are usually short, on average about 2-3 terms, and their information needs are not as clear as the author, venue, or paper title queries. They are very typical ad-hoc search queries, in a specific domain – computer science.
The first part of the difficulty is noise in the exact-match signal. Due to language variety, the query term may not appear frequently enough in the relevant documents (vocabulary mismatch), for example, `softmax categorization' versus `softmax classication'. The segmentation of query concepts is also a problem. For example, the whole query `natural language interface' should be considered as a whole because it is one informative unit, but the ranking model matches all words and n-grams of the query, and the result is dominated by the popular phrase `natural language'.
The second part is more subtle as it is about the meaning of the query concept. There can be multiple aspects of the query concept, and a paper that mentions it the most frequently may not be about the most important aspect. For example, `ontology construction' is about how to construct ontologies in general, but may not be about how a specic ontology is constructed; `dynamic programming segmentation' is about word segmentation in which dynamic programming is essential, but is not about image segmentation. To conclude, our analysis finds that there is a gap between query-documents' textual similarity and their semantic relatedness. Ranking systems that rely solely on term-level statistics have few principled ways to resolve this discrepancy.
EXPLICIT SEMANTIC RANKING
Knowledge Graph
In this work, we build a standard knowledge graph G with concept entities (E) and edges (predicates P and tails T) by harvesting S2’s corpus and Freebase, and use a popularity based entity linking system to link query and documents.
CONCEPT ENTITIES in our knowledge graph can be collected from two different sources: corpus-extracted (Corpus) and Freebase. Corpus entities are keyphrases automatically extracted from S2’s corpus. Keyphrases extraction is a widely studied task that aims to find representative keyphrases for a document, for example, to approximate the manually assigned keywords of a paper....The second source of entities is Freebase. Despite being a general domain knowledge base, our manual examination found that Freebase has rather good coverage on the computer science concept entities in S2’s head queries
Edges in our knowledge graph include two parts: Predicates P and tails T. From Freebase and the CMNS annotated S2 corpus, the following four types of edges are harvested:
• Author edges link an entity to an author if the author has a paper which mentioned the entity in the title;
• Context edges link an entity to another entity if the two co-occur in a window of 20 words more than 5
times in the corpus;
• Desc edges link an entity to words in its Freebase description (if one exists); and
• Venue edges link an entity to a venue if the entity appears in the title of a paper published in the venue.
The knowledge graph G contains two types of Concept Entities: Corpus-extracted (Corpus) and Freebase, and four types of edges: Author, Context, Desc(ription) and Venue.
Embeddings of our entities are trained based on their neighbors in the knowledge graph. The graph structure
around an entity conveys the semantics of this entity. Intuitively, entities with similar neighbors are usually related. We use entity embeddings to model such semantics.
Ranking Model
Given a query q, and a set of candidate documents D = {d1,...,dn}, the goal of ESR is to find a ranking function
f(q,d|G), that better ranks D using the explicit semantics stored in the knowledge graph G. The explicit semantics include entities (E = {e1,...,e|E|}) and edges (predicates P and tails T).
ESR represents query and documents by their bag-of-entities constructed using their entity annotations (Entity Linking)
ESR matches query and documents’ entity representations using the knowledge graph embedding
ESR first calculates a query-document entity translation matrix. Each element in the matrix is the connection
strength between a query entity ei and a document entity ej , calculated by their embeddings’ cosine similarity. Scores less than 1 identify related entities as a function of the knowledge graph structure and provide soft match signals.
Cada entrada da matriz Entity Similarity Matrix representa a similaridade entre a entidade ei que pertence a query (Eq) com a entidade ej do documento (Ed)
Then ESR performs two pooling steps
The first step is a max-pooling along the query dimension ... The max-pooling matches each document entity to its closest query entity using embeddings, which is the exact-match if one exists. Its score describes how closely related a document entity is to the query. ... is used to match each document entity to its closest query entity.
O vetor max-pooled chamado S(d) tem a mesma dimensão do vetor de entidades Ed
Cada entrada corresponde ao máximo entre as similaridades da entidades ej do Ed com as demais entidades ei de Eq
The second step is a FIVE bin-pooling (histogram) to count the matches at different strengths ...
The bin-pooling counts the number of document entities with different connection strengths to the query. ... is used to summarize the matching scores of each document entity into match frequencies at dierent strengths.
FIVE bins: [1,1], ]1, 0.75], ]0.75, 05.], ]0.5, 0.25], ]0.25, 0[
(stk, edk) is the range for the kth bin. Bk is the number of document entities whose scores fall into this bin.
EXPERIMENTAL METHODOLOGY
The queries of the benchmark are sampled from S2's query log in the rst six months of 2016.
Baselines: The baseline is the Semantic Scholar (S2) production system on July 1st 2016 .... We
also include BM25-F and tf.idf-F for reference. The BM25 and vector space model are applied to the paper's title, abstract, and body elds. Their parameters (field weights) are learned using the same learning to rank model as our method in the same cross-validation setting.
All other settings follow previous standards. The embedding dimensionality is 300; The five bins and three paper fields (title, abstract, and body) generate 15 features, which are combined with S2's original score using linear RankSVM [14]. All models and baselines' parameters are trained and evaluated using a 10-fold cross validation with 80% train, 10% development and 10% test in each fold. The hyper-parameter `c' of RankSVM is selected from {0:0001; 0:0005; 0:001; 0:005; 0:01; 0:05; 0:1; 0:5; 1} using the development part of each fold.
EVALUATION RESULTS
with about 2 - 3% more improvements on NDCG@1 and NDCG@5 than on NDCG@20.
ESR matches query and documents on their entity representations using the semantics from the knowledge graph. The semantic contribution to ESR's ranking can come from two aspects: exact match with smart phrasing and soft match with knowledge graph embedding.
Entities' exact match information (the first bin) provides about 6% gains over S2, with Context and Venue.
The soft match information (later bins) contributes approximately another 5% of improvement. The soft-match bins record how many document entities are related to the query entities at certain strengths. Intuitively, the soft match should contribute for all queries as knowing more about the document entities should always help. But our examination finds this information is more eective on some queries,
This experiment helps us understand why the explicit semantics in knowledge graphs is useful for ranking. By knowing which entity a phrase refers to and whether different phrases are about the same thing, the exact match signal is polished. By knowing the relatedness between query entities and document entities through their embeddings, additional soft match connections that describe query-document relatedness with semantics are incorporated.
ESR captures the semantic relatedness between entities by using their distance in the embedding space. An advantage of using embeddings compared with the raw knowledge graph edges is their eficiency, which is tightly constrained in online search engines. By embedding the neighbors of an entity into a dense continuous vector, at run time, ESR avoids dealing with the graph structure of the knowledge
graph, which is sparse, of varying length, and much more expensive to deal with than fixed length dense vectors. However, does ESR sacrifice effectiveness for eficiency? The fourth experiment studied the influence of embeddings on ESR's effectiveness, by comparing it with the ranking performance of using raw knowledge graph edges. In this experiment, a discrete vector representation is formed for each entity for each edge type. Each of the vector's dimensions is a neighbor of the entity. Its weight is the frequency of them being connected together. The Raw variants of ESR are then formed with the knowledge graph embedding replaced by this discrete entity representation and everything
else kept the same.
The result shows that ESR actually benefits from the knowledge graph embedding; Raw methods almost always perform worse than the corresponding ESR version. We believe that the advantage of knowledge graph embedding is similar with word embedding [18]: the embedding better captures the semantics by factoring the raw sparse data into a smooth low-dimensional space.
We believe ESR's two-stage pooling provides the right balance of model complexity and abstraction granularity, given the size of our training data.
Comentários
Postar um comentário
Sinta-se a vontade para comentar. CrÃticas construtivas são sempre bem vindas.