Predicting Query Performance
| Date | Place | Author | Keyword |
| 2002 | SIGIR | Steve Cronen-Townsend,Yun Zhou,W. Bruce Croft | query clarity |
Summary
This paper develops an original method for predicting query performance. Query performance is predicted by calculating a
clarity score, which is a cross-entropy between the query language model and the background (collection) language model. It is suggested that low clarity scores indicate
query ambiguity and are correlated with poor query performance.
Background
This paper leverages two language modeling ideas for
query clarity score computation:
- (query) relevance models
- Kullback-Leibler divergence
Paper is based on the following two hypotheses:
Hypothesis 1: Highly coherent and "clear" queries (queries about a single topic) produce relevance models characterized by
unusually large probabilities for a small number of topical terms. On the other hand, ambiguous queries produce relevance models that are much "smoother", and hence closer to the background (collection) language model.
Hypothesis 2: There is a strong correlation between query clarity score and query performance
Contribution
Computing the query clarity score
- Clarity score is computed by KL(Q|C) , where Q represents a query language model and C represents a background (collection) language model.
- The higher is the KL divergence score, the higher is the query clarity score
- Q is built based on relevance model induced by a query
- RM-1 method is used for building a relevance model, i.e., it is assumed that all terms are sampled from the same model
- Note that, if Q is smoothed, KL(Q|C) involves summation over all terms in the vocabulary.
Query clarity score applications
Correlation with MAP scores on TREC
- Authors use Spearmann rank correlation test to determine correlation between clarity scores and MAP performance on TREC data.
- On the scale of [-1,1] (-1 - opposite ranks, 1 - full rank correlation), the correlation is between 0.368-0.577 on various TREC corpora, which shows a significant correlation between query MAP performance and clarity score.
Automatic Query Classification
- In this task, query is either considered "good" or "bad", based on it's clarity score. "Good" query is a query, which should yield coherent results, and thus it's clarity score will be high. "Bad" query is an ambiguous query, which should yield a low clarity score.
- Simple thresholding rule is proposed to detect "bad queries":
a query is deemed clear enough if an estimated 80% or more of single term queries would have a lower clarity score.
- Performance of this rule favorably compares to the optimal situation, where all relevance judgments in the collection are known.
Related work
- "Query performance prediction in Web Search Environments" by Zhou and Croft presents additional techniques for performance prediction that outperform the original query-clarity method on web corpora
- "What makes a query difficult?" by D.Carmel et al. presents an approach to performance prediction that relies on corpus information, not only the query information - i.e., how 'hard' the corpus is for retrieval task in general.
Reference
to top