Transforming Long Queries
Principal Investigator:
W. Bruce Croft, PI
croft@cs.umass.edu
Center for Intelligent Information Retrieval (CIIR)
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, MA 01003-9264
Project Summary
Long queries represent a small but significant percentage of the queries submitted to web search engines currently. In other applications, such as collaborative question answering where people ask questions for other people to answer, long queries are typical, rather than unusual. Many information needs can be more easily expressed using longer, sentence-length queries, but the inadequacies of current search engines force people to try to think up the right combination of keywords to find relevant documents. This can be very difficult and often leads to search failures. On the other hand, long queries are handled poorly by current search engines. This is due at least in part to these queries being part of the “long tail”, meaning that they are infrequent and lack many of the statistical features that are used for effective ranking of short queries. Being able to effectively handle long queries would represent a significant advance in the capability of search engines from the user’s point of view, and should substantially improve our understanding of the underlying information retrieval process. In this project, we are studying long queries from web query logs and other sources such as TREC collections in order to develop new retrieval models and techniques for effective ranking. In particular, we focus on techniques for transforming long queries into equivalent queries that are more likely to perform well.
Query transformation steps such as stemming and expansion have been studied for many years, and segmentation has become an important part of processing web queries. In this project, we are working on two major changes; developing an integrated model of query transformation that includes all of these steps as part of retrieval, and focusing on long queries for which there is little click data. These changes will enable us to incorporate additional information that can be derived from a long query, such as relationships, and will be a significant development in the state of the art of retrieval models.
Research in this area will have a direct impact on the ability of web search engines to provide effective answers for more complex questions. Given that search is one of the two most common activities on the web and people often have trouble finding good answers to many questions, this research could have a very broad impact, both in the home and the office.
This work is supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the National Science Foundation (NSF IIS-0914442).