Learning Word Relationships Using TupleFlow

James Allan, PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
allan@cs.umass.edu
http://www.cs.umass.edu/~allan

W. Bruce Croft, co-PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
croft@cs.umass.edu
http://www.cs.umass.edu/~croft

Project Award Information

National Science Foundation Award Number: IIS-0844226
Cluster Exploratory (CluE) project

Duration: 02/01/2009 -- 01/31/2012

Project Summary

One of the key problems in search, especially in applications involving longer queries, is to transform queries into more effective forms with the same intent. For example, meals without meat would produce better results if it were transformed into vegetarian recipes. The Center for Intelligent Information Retrieval (CIIR) is investigating (1) how to find relationships between concepts, in particular different ways that the same concepts can be expressed, and (2) how those different relationships can be used to improve search. The CIIR is exploring entirely automatic techniques to identify relationships within massive corpora of text, using both pre-processing and search-time computation. The quality of word relationships that are discovered are being tested using large-scale search experiments. To do this, the CIIR will make use of the Google/IBM cluster.

In addition, the CIIR is adapting a new computational framework that was developed at the University of Massachusetts Amherst, called TupleFlow, to the Hadoop distributed processing environment. TupleFlow is an extension of the well-known MapReduce model, with advantages in flexibility, scalability, disk abstraction, and low abstraction penalties. The TupleFlow approach was developed for work such as relationship finding, and supports large-scale indexing and analysis operations on massive quantities of text.

The specific approaches to mining word relationships the CIIR will study are based on both direct and indirect word co-occurrence, as well as static and dynamic computation. In particular, the work will focus on techniques that create and use Web-based corpora of “comparable” sentences and text chunks for estimating word and phrase translation probabilities, and techniques that derive relationships from “context vectors” that represent word and phrase meanings. The quality of the word relationships that are discovered will be tested using large-scale retrieval experiments.

The potential payoff of the techniques for discovering and using word relationships in search engines is very significant and likely to be a major advance in search engine technology. Constructing and using relationships of this nature for search requires text comparisons at the peta (1015) scale and cannot be done using a handful of computers. It will only be with experiments on a scale that requires the computational facility provided by the Google/IBM facility that the true worth of these techniques can be determined.

This work has the potential to result in the next generation of search engines, ones that exploit word relationships to capture semantic relationships between concepts. The open problems are fundamental research issues of language technology and search: algorithms and models to efficiently and effectively identify relationships and use them to find material that responds to a request. The work also pushes the boundaries of the well-known MapReduce framework, integrating systems-level research along with efforts to capture semantics.

Annual Report Summary (Nov. 2009)
Annual Report Summary (Jan. 2011)
Final Report Summary (Jan. 2012)

Graduate Students Involved in this Project:

Marc Cartright
Jeff Dalton
Van Dang
Shiri Dori-Hacohen
Henry Feild
Sam Huston
Youngho Kim
Chia Lee
Xiaoye Wu
Xing Yi

Publications:

  • Dang, V. and Croft, W. B., "Query Reformulation Using Anchor Text" , in the Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM), pp. 41-50, 2010.
  • Dang, V., Xue, X. and Croft, W. B., "Context-based Quasi-Synonym Extraction" , CIIR Technical Report, 2009.
  • Xue, X., Dang, V. and Croft, W. B., "Query Substitution based on N-gram Analysis" , CIIR Technical Report, 2009.
  • Cartright, M., Allan, J., Lavrenko, V. and McGregor, A., "Fast Query Expansion Using Approximations of Relevance Models" , in the Proceedings of the Conference on Information and Knowledge Management, pp. 1573-1576, 2010.
  • Huston, S., Moffat, A. and Croft, W. B., "Efficient Indexing of Repeated n-Grams" , in the Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM), 2011, pp. 127-136.
  • Yi, X. and Allan, J., "Discovering Missing Click-through Query Language Information for Web Search," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 153-162.
  • Dang, V., Croft, W. B. and McGregor, A., "Query Suggestion Using Anchor Text," CIIR Technical Report.
  • Kim, Y., Seo, J. and Croft, W. B. , "Automatic Boolean Query Suggestion for Professional Search," Proceedings of SIGIR '11, pp. 825-834.
  • Dalton, J., Allan, J. and Smith, D., "Passage Retrieval for Incorporating Global Evidence in Sequence Labeling," Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 355-364.
  • Yi, X., "Discovering and Using Implicit Data for Information Retrieval," Ph.D. thesis, May, 2011.
  • Seo, J., "Search Using Social Media Structures," Ph.D. Thesis, University of Massachusetts Amherst.
  • Cartright, M., Feild, H. and Allan, J., "Evidence Finding using a Collection of Books," Proceedings of The International Conference on Information and Knowledge Management BooksOnline workshop (BooksOnline'11), pp. 11-18.

This work is supported by the National Science Foundation (Award Number IIS-0844226).