Searching Archives of Community Knowledge

Principal Investigator:
W. Bruce Croft, PI
croft@cs.umass.edu

Center for Intelligent Information Retrieval (CIIR)
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, MA 01003-9264

Project Summary

One of the emerging trends in Web information services is the growth in sites where people answer other peoples’ questions. This started as digital reference services, but has now become a popular part of Web search services on sites such as Yahoo! Answers. The huge number of retail and business sites that provide a FAQ (Frequently Asked Questions) service can also be viewed as the same type of system. Answering questions using a community of people is also an extremely popular Web application in other countries. Over time, these services build up very large archives of previous questions and their answers. These archives represent a new type of community-based knowledge that supplements and, in many cases, improves on the usual information found using standard Web search. Many of the questions answered regularly using this service do not, in fact, produce any useful results in a standard ranking of Web pages. The fact that people are responding to other peoples’ questions means that, in contrast to Web search, the questions that people ask are quite long. In these types of services, it appears that people really ask what they want to know, rather than being forced to think up a couple of keywords to attempt to capture the essence of their question. A Q&A (Question and Answer) archive, therefore, represents a new and exciting linguistic resource that enables us to begin to investigate approaches to dealing with longer, more detailed questions and break through the “keyword barrier” of Web search.

In order to avoid the lag time involved with waiting for a personal response, a Q&A service will typically automatically search the Q&A archive to see if the same question has previously been asked. If the question is found, then a previous answer can be provided with very little delay. In contrast to the usual search paradigm, where the question is used to search the database of potential answers, in this case the question is used to search the database of previous questions, which in turn are associated with answers. In this project, we are studying the task of finding good answers in Q&A archives by investigating techniques for question retrieval and comparing them to alternatives such as direct answer retrieval. The techniques that we are developing to search the Q&A archives also have the potential to have a significant impact on all types of search engines. The large Q&A archives can be used as training data for models of text transformation. In other words, by developing models that learn how to recognize questions using these resources, we will also be learning how concepts or topics can be expressed in different ways. These transformation models could then be used to significantly improve the robustness of the topic models used in search engines, which will in turn substantially improve the effectiveness of the system.

The problem of recognizing or finding questions has received little attention, but is an important problem that is conceptually different than typical document search. The use of text transformation models learned from the Q&A archives has the potential of considerably advancing our knowledge of how models that have been very successful in machine translation applications can be applied to improve the effectiveness of search. We also expect to contribute new approaches to learning translation probabilities from multiple sources.

The outcomes of this project have a potential impact both on the emerging community-based Q&A services and on search engines that use query logs to suggest related queries. Based on our existing collaborations, we expect the results of this research to be used directly in operational systems. The research on transformation models also has the potential of significantly improving the coverage or recall of search engines in general. Appropriate research results from this project will be incorporated into the popular Lemur toolkit developed jointly at the University of Massachusetts Amherst and Carnegie-Mellon University for general distribution.

View details on the project's Activities and Findings

Key Graduate Students involved in the project:

- Jangwon Seo
- Xiaobing Xue

Publications:

Xue, X., Croft, W. B. and Jeon, J., "Retrieval Models for Question and Answer Archives," in the Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08), pp. 475-482.

Seo, J., and Croft, W.B., "UMass at TREC 2008 Blog Distillation Task," in the electronic Proceedings of the Text REtrieval Conference (TREC) 2008, Gaithersburg MD, 2008.

Kim., J., Xue, X. and Croft, W.B., "A Probabilistic Retrieval Model for Semistructured Data,"in the Proceedings of the 31st European Conference on Information Retrieval (ECIR 09), 2009, pp. 228-239.

Bendersky, M. and Croft, W.B., "Analysis of Long Queries in a Large Scale Search Log," in Proceedings of the Workshop on Web Search Click Data (WSCD 2009), pp. 8-14.

Seo, J., Croft, W.B. and Smith, D., "Online Community Search Using Thread Structure," in Proceedings of the 32nd Annual ACM SIGIR Conference (SIGIR 2009), pp. 1907-1910.

Xue, X. and Croft, W.B., "Transforming Patents into Prior-Art Queries,"in the Proceedings of the 32nd Annual ACM SIGIR Conference (SIGIR 2009), pp. 808-809.

Xue, X., Dang, V. and Croft, W.B., "Query Substitution based on N-gram Analysis," CIIR Technical Report, 2009.

Seo, J. and Jeon, J., "High Precision Retrieval Using Relevance-Flow Graph," in the Proceedings of the 32nd Annual ACM SIGIR Conference (SIGIR 2009), pp. 694-695.

Dang, V., Xue, X., and Croft, W.B., "Context-based Quasi-Synonym Extraction," CIIR Technical Report, 2009.

Seo, J. and Croft, W. B., "Thread-based Expert Finding", in the Proceedings of the 32nd International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR '09) Workshop on Search in Social Media, 2009.

Xue, X. and Croft, W. B., "Automatic Query Generation for Patent Search", in the Proceedings of the 8th ACM Conference on Information and Knowledge Management (CIKM 2009), pp. 2037-2040.

Dang, V. and Croft, W. B., "Query Reformulation Using Anchor Text", in the Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM) 2010, pp. 41-50.

Seo, J. and Croft, W. B., "Geometric Representations for Multiple Documents", in the Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), pp. 251-258.

Dang, V., Bendersky, M. and Croft, W. B., "Learning to Rank Query Reformulations", in the Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), pp. 807-808.

Park, J. and Croft, W. B., "Query Term Ranking based on Dependency Parsing of Verbose Queries", in the Proceedings of the 33rd international ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), pp. 829-830.

Xue, X. and Croft, W. B., "Representing Queries as Distributions", in the Workshop Proceedings of the 33rd International ACM SIGIR Conference on Research and Information Retrieval (SIGIR 2010).

Xue, X., Croft, W. B. and Smith, D., "Modeling Reformulation Using Query Distributions", submitted to the 19th ACM International Conference on Information and Knowledge Management (CIKM 2010).

Xue, X., Huston, S. and Croft, W. B., "Improving Verbose Queries using Subset Distribution", submitted to the 19th ACM International Conference on Information and Knowledge Management (CIKM 2010).

This work is supported in part by the Center for Intelligent Information Retrieval (CIIR) and in part by the National Science Foundation (NSF III COR - 0711348).