Transforming Long Queries

Principal Investigator:
W. Bruce Croft, PI
croft@cs.umass.edu

Center for Intelligent Information Retrieval (CIIR)
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, MA 01003-9264

Project Summary

Long queries represent a small but significant percentage of the queries submitted to web search engines currently. In other applications, such as collaborative question answering where people ask questions for other people to answer, long queries are typical, rather than unusual. Many information needs can be more easily expressed using longer, sentence-length queries, but the inadequacies of current search engines force people to try to think up the right combination of keywords to find relevant documents. This can be very difficult and often leads to search failures. On the other hand, long queries are handled poorly by current search engines. This is due at least in part to these queries being part of the “long tail”, meaning that they are infrequent and lack many of the statistical features that are used for effective ranking of short queries. Being able to effectively handle long queries would represent a significant advance in the capability of search engines from the user’s point of view, and should substantially improve our understanding of the underlying information retrieval process. In this project, we are studying long queries from web query logs and other sources such as TREC collections in order to develop new retrieval models and techniques for effective ranking. In particular, we focus on techniques for transforming long queries into equivalent queries that are more likely to perform well.

Query transformation steps such as stemming and expansion have been studied for many years, and segmentation has become an important part of processing web queries. In this project, we are working on two major changes; developing an integrated model of query transformation that includes all of these steps as part of retrieval, and focusing on long queries for which there is little click data. These changes will enable us to incorporate additional information that can be derived from a long query, such as relationships, and will be a significant development in the state of the art of retrieval models.

Research in this area will have a direct impact on the ability of web search engines to provide effective answers for more complex questions. Given that search is one of the two most common activities on the web and people often have trouble finding good answers to many questions, this research could have a very broad impact, both in the home and the office.

View details on the project's activities and findings.

Publications:

IR-739: (2009) Bendersky, M., Metzler, D. and Croft, W. B. , "Learning Concept Importance Using a Weighted Dependence Model," in the Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), pp. 31-40.

IR-751: (2010) Bendersky, M., Croft, W. B. and Smith, D., "Structural Annotation of Search Queries Using Pseudo-Relevance Feedback," CIIR Technical Report.

IR-755: (2010) Balasubramanian, N., Bendersky, M. and Allan, J., "Cost-Effective Combination of Multiple Rankers: Learning When Not To Query," NESCAI 2010, Amherst, MA, April 15-17, 2010.

IR-760: (2010) Dang, V., Bendersky, M. and Croft, W. B. , "Learning to Rank Query Reformulations," in the Proceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010) Geneva, Switzerland, July 19-23, 2010, pp. 807-808.

IR-764: (2010) Croft, W. B. and Bendersky, M., "Do Longer Queries Retrieve More Diverse Results?," CIIR Technical Report.

IR-783: (2010) Bendersky, M., Croft, W. B. and Diao, Y., "Quality-Biased Ranking of Web Documents," Proceedings of the Fourth International Conference on Web Search and Data Mining (WSDM 2011), pp. 95-104.

IR-799: (2010) Bendersky, M., Fisher, D. and Croft, W. B. , "UMass at TREC 2010 Web Track: Term Dependence, Spam Filtering and Quality Bias," Proceedings of Text REtrieval Conference (TREC 2010),Gaithersburg, MD, November 15-19, 2010.

IR-805: (2011) Bendersky, M., Metzler, D. and Croft, W. B. , "Parameterized Concept Weighting in Verbose Queries," in the Proceedings of the 34th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'11), pp. 605-614.

IR-806: (2011) Park, J., Croft, W. B. and Smith, D., "Quasi-Synchronous Dependence Model for Information Retrieval," in the Proceedings of The ACM Conference on Information and Knowledge Management (CIKM 2011), pp. 17-26.

IR-810: (2011) Lee, C. and Croft, W. B. , "Generating Queries from User-Selected Text," Proceedings of IIIX 2012, pp. 100-109.

IR-824: (2011) Bendersky, M., Croft, W. B. and Smith, D., "Joint Annotation of Search Queries," in the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pp. 102-111.

IR-827: (2011) Xue, X. and Croft, W. B. , "Modeling Subset Distributions for Verbose Queries," Proceedings of the 34th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'11), pp. 1133-1334.

IR-846: (2011) Bendersky, M., Metzler, D. and Croft, W. B. , "Effective Query Formulation with Multiple Information Sources," Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 443-452.

IR-847: (2011) Dang, V., Xue, X. and Croft, W. B. , "Inferring Query Aspects from Reformulations Using Clustering," Proceedings of The ACM Conference on Information and Knowledge Management (CIKM 2011), pp. 2117-2120.

IR-860: (2011) Lee, C. and Croft, W. B. , "Evaluating Search in Personal Social Media Collections,"Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM 2012), pp. 683-692.

IR-868: (2012) Li, H., Xu, G., Croft, W. B. , Bendersky, M., Wang , Z. and Viegas, E., "QRU-1: A Public Dataset for Promoting Query Representation and Understanding Research," Workshop on Web Search Click Data, February 12, 2012, Seattle, Washingtion, USA (WSCD 2012).

IR-874: (2012) Bendersky, M. and Croft, W. B. , "Modeling Higher-Order Term Dependencies in Information Retrieval using Query Hypergraphs," Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 941-950.

IR-880: (2012) Xue, X. and Croft, W. B. , "Generating Reformulation Trees for Complex Queries," Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 525-534.

IR-886: (2012) Kim, J. and Croft, W. B. , "A Field Relevance Model for Structured Document Retrieval," Proceedings of the 34th European Conference on Information Retrieval (ECIR), pp. 97-108.

IR-895: (2012) Bendersky, M. and Smith, D., "A Dictionary of Wisdom and Wit: Learning to Extract Quotable Phrases," in the online Proceedings of the Workshop on Computational Linguistics for Literature, co-located with the 2012 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 69-77.

IR-909: (2012) Xue, X., "Modeling Reformulation as Query Distributions," PhD Thesis, University of Massachusetts Amherst, 2012.

IR-913: (2012) Kim, J., "Retrieval and Evaluation Techniques for Personal Information," Ph.D Thesis, University of Massachusetts Amherst, 2012.

Students involved in the project:

  • Michael Bendersky (2012 Ph.D. graduate)
  • Youngho Kim
  • Chia-Jung Lee
  • Jae Hyun Park
  • Xiaobing Xue (2012 Ph.D. graduate)

NSF Project Abstract

This work is supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the National Science Foundation (NSF IIS-0914442).