Learning Word Relationships Using TupleFlow

James Allan, PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
allan@cs.umass.edu
http://www.cs.umass.edu/~allan

W. Bruce Croft, co-PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
croft@cs.umass.edu
http://www.cs.umass.edu/~croft


Project Award Information

National Science Foundation Award Number: IIS-0844226
Cluster Exploratory (CluE) project

Duration: 02/01/2009 -- 01/31/2012


Project Summary


One of the key problems in search, especially in applications involving longer queries, is to transform queries into more effective forms with the same intent. For example, meals without meat would produce better results if it were transformed into vegetarian recipes. The Center for Intelligent Information Retrieval (CIIR) is investigating (1) how to find relationships between concepts, in particular different ways that the same concepts can be expressed, and (2) how those different relationships can be used to improve search. The CIIR is exploring entirely automatic techniques to identify relationships within massive corpora of text, using both pre-processing and search-time computation. The quality of word relationships that are discovered are being tested using large-scale search experiments. To do this, the CIIR will make use of the Google/IBM cluster.

In addition, the CIIR is adapting a new computational framework that was developed at the University of Massachusetts Amherst, called TupleFlow, to the Hadoop distributed processing environment.  TupleFlow is an extension of the well-known MapReduce model, with advantages in flexibility, scalability, disk abstraction, and low abstraction penalties. The TupleFlow approach was developed for work such as relationship finding, and supports large-scale indexing and analysis operations on massive quantities of text.

The specific approaches to mining word relationships the CIIR will study are based on both direct and indirect word co-occurrence, as well as static and dynamic computation. In particular, the work will focus on techniques that create and use Web-based corpora of “comparable” sentences and text chunks for estimating word and phrase translation probabilities, and techniques that derive relationships from “context vectors” that represent word and phrase meanings. The quality of the word relationships that are discovered will be tested using large-scale retrieval experiments. 

The potential payoff of the techniques for discovering and using word relationships in search engines is very significant and likely to be a major advance in search engine technology. Constructing and using relationships of this nature for search requires text comparisons at the peta (1015) scale and cannot be done using a handful of computers. It will only be with experiments on a scale that requires the computational facility provided by the Google/IBM facility that the true worth of these techniques can be determined.

This work has the potential to result in the next generation of search engines, ones that exploit word relationships to capture semantic relationships between concepts. The open problems are fundamental research issues of language technology and search: algorithms and models to efficiently and effectively identify relationships and use them to find material that responds to a request. The work also pushes the boundaries of the well-known MapReduce framework, integrating systems-level research along with efforts to capture semantics. 

Annual Report Summary (Nov. 2009)
Annual Report Summary (Jan. 2011)
Final Report Summary (Jan. 2012)

Graduate Students Involved in this Project:

Publications:


This work is supported by the National Science Foundation (Award Number IIS-0844226).