Current CIIR Projects


Current CIIR Projects l Recent Projects


Lemur/Indri
The Lemur Project is collaboration between the CIIR and the School of Computer Science at Carnegie Mellon University. The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval, where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, cross-language IR, summarization, filtering, and classification. As part of the Lemur project, the CIIR has developed Indri, a language model-based search engine for complex queries. In an NSF funded CRI collaborative research project between UMass Amherst and CMU, the team is focusing on the continued development of the open-source Lemur software toolkit for language modeling and information retrieval.

Transforming Long Queries
The focus of this NSF-funded project is on developing retrieval algorithms and query processing techniques that will significantly improve the effectiveness of long queries. A specific emphasis is on techniques for transforming long queries into semantically equivalent queries that produce better search results. In contrast to purely linguistic approaches to paraphrasing, query transformation is done in the context of, and guided by, retrieval models. Query transformation steps such as stemming, segmentation, and expansion have been studied for many years, and we are both extending and integrating this work in a common framework.

Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR
This NSF-funded Data Intensive Computing project is a collaborative project with Tufts University and the Internet Archive. It aims for transformative advances in current technology, to provide improved automatic support for search and analysis, through the use of data-intensive processing of large corpora. This research is being carried out using a collection of over a million scanned books gathered by the Internet Archive. The collection includes 8.5 terabytes of text and half a petabyte of scanned images. CIIR Researchers will develop new approaches to processing the large collection. The resulting improved corpus will be indexed at the Internet Archive, allowing more accurate and powerful search. Researchers at Tufts University will develop approaches for exploratory data analysis on the processed collection.

Learning Word Relationships Using TupleFlow
This NSF-funded Cluster Exploratory (CluE) project explores how to use semantic relationships between words and how they can be used to express the same content in order to improve the effectiveness of the ranking. We will find those relationships using the Google/IBM cluster and a new distributed computational framework that was developed at UMass Amherst. The TupleFlow system was developed for the type of indexing and analysis operations that are required for the study of word relationships on a large scale. TupleFlow is an extension of MapReduce, with advantages in terms of flexibility, scalability, disk abstraction, and low abstraction penalties.

Broad Operational Language Technology (BOLT)
The Broad Operational Language Technology (BOLT) Program has a goal of creating technology capable of translating multiple foreign languages in all genres, retrieving information from the translated material, and enabling bilingual communication via speech or text. The CIIR at UMass Amherst is part of the SRI team. This team will conduct work for activities a), genre-independent translation and information retrieval system; b), human-machine communication system; c), human-human dialogue system; and d), Arabic dialect components. The CIIR will be focusing on developing a cross-lingual information retrieval system for informal document genres (e.g., forums).

Situation Understanding Bot Through Language and Environment (SUBTLE)
This USAR MURI project is a collaboration with Cornell, George Mason, Stanford, University of Pennsylvania, University of Massachusetts Amherst, and University of Massachusetts Lowell. The researchers are developing methods for constructing a computationally tractable end-to-end system for a habitable subset of English, including both a formal representation of the implicit meaning of utterances and the generation of control programs for a robot platform. They are also developing a virtual simulation of the USAR environment to enable inexpensive large-scale corpus collection to proceed during many stages of system development.The UMass Amherst team is working on the natural language processing and machine learning research on the project, specifically the command interface and command logic.

Flexible Acquisition and Understanding System for Text (FAUST)
This DARPA project is developing an automated machine reading system that makes the information in natural language texts accessible to formal reading systems. UMass Amherst is part of the SRI International team that also consists of Columbia, Stanford, University of Illinois, University of Washington, University of Wisconsin, and Wake Forest University.