Current CIIR Projects
Current CIIR Projects l Recent Projects
The Lemur Project is a collaboration with the CIIR and the School of Computer Science at Carnegie Mellon University. The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval, where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, cross-language IR, summarization, filtering, and classification. As part of the Lemur project, the CIIR has developed Indri, a language model-based search engine for complex queries. In an NSF funded CRI collaborative research project between UMass Amherst and CMU, the team is focusing on the continued development of the open-source Lemur software toolkit for language modeling and information retrieval.
Connecting the Ephemeral and Archival Information Networks
This NSF-funded project is a collaboration with the CIIR, Carnegie Mellon University, and RMIT University. The team will use the explicit and implicit links between the ephemeral and archival networks to improve the effectiveness of search that is targeted at social data, web data, or both. Researchers will demonstrate the validity of our hypothesis using a range of existing TREC tasks focused on either social media search or web search. In addition, we will explore two new tasks, conversation search and aggregated social search, which can exploit the integrated network of ephemeral and archival information.
Transforming Long Queries
The focus of this NSF-funded project is on developing retrieval algorithms and query processing techniques that will significantly improve the effectiveness of long queries. A specific emphasis is on techniques for transforming long queries into semantically equivalent queries that produce better search results. In contrast to purely linguistic approaches to paraphrasing, query transformation is done in the context of, and guided by, retrieval models. Query transformation steps such as stemming, segmentation, and expansion have been studied for many years, and we are both extending and integrating this work in a common framework.
Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR
This NSF-funded Data Intensive Computing project is a collaborative project with Tufts University and the Internet Archive. It aims for transformative advances in current technology, to provide improved automatic support for search and analysis, through the use of data-intensive processing of large corpora. This research is being carried out using a collection of over a million scanned books gathered by the Internet Archive. The collection includes 8.5 terabytes of text and half a petabyte of scanned images. CIIR Researchers will develop new approaches to processing the large collection. The resulting improved corpus will be indexed at the Internet Archive, allowing more accurate and powerful search. Researchers at Tufts University will develop approaches for exploratory data analysis on the processed collection.
Learning Word Relationships Using TupleFlow
This NSF-funded Cluster Exploratory (CluE) project explores how to use semantic relationships between words and how they can be used to express the same content in order to improve the effectiveness of the ranking. We will find those relationships using the Google/IBM cluster and a new distributed computational framework that was developed at UMass Amherst. The TupleFlow system was developed for the type of indexing and analysis operations that are required for the study of word relationships on a large scale. TupleFlow is an extension of MapReduce, with advantages in terms of flexibility, scalability, disk abstraction, and low abstraction penalties.
Broad Operational Language Technology (BOLT)
The Broad Operational Language Technology (BOLT) Program has a goal of creating technology capable of translating multiple foreign languages in all genres, retrieving information from the translated material, and enabling bilingual communication via speech or text. The CIIR at UMass Amherst is part of the IBM team. The CIIR will be focusing on developing a cross-lingual information retrieval system for informal document genres (e.g., forums).
Proteus Infrastructure: Work Aggregation and Entity Extraction
This Mellon Foundation grant supports development of software and techniques for scholars in the humanities to use in processing large corpora of digitized books. Specifically, this a pilot project to build and evaluate research infrastructure for scanned books. While there are several large scanned book collections (for example the Internet Archive) much of this is unstructured and not easily used by scholars in the humanities. The grant will support building a Proteus infrastructure which will help scholars navigate and use such collections more easily. Components of the infrastructure include automatically identifying a book’s language, linking multiple editions of canonical works, finding quotations in canonical works, and entity detection.
OCRing Early Modern Text
This Mellon Foundation grant is part of a larger grant to Texas A&M. The teams will recognize the text from the 18th century English books using optical character recognition systems, and they will use their technology to automatically estimate OCR errors and correct the output of multiple OCR engines. This will be done using fast alignment algorithms.
Situation Understanding Bot Through Language and Environment (SUBTLE)
This USAR MURI project is a collaboration with Cornell, George Mason, Stanford, University of Pennsylvania, University of Massachusetts Amherst, and University of Massachusetts Lowell. The researchers are developing methods for constructing a computationally tractable end-to-end system for a habitable subset of English, including both a formal representation of the implicit meaning of utterances and the generation of control programs for a robot platform. They are also developing a virtual simulation of the USAR environment to enable inexpensive large-scale corpus collection to proceed during many stages of system development.The UMass Amherst team is working on the natural language processing and machine learning research on the project, specifically the command interface and command logic.
Flexible Acquisition and Understanding System for Text (FAUST)
DARPA project is developing an automated machine reading system that makes the information in natural language texts accessible to formal reading systems. UMass Amherst is part of the SRI International team that also consists of Columbia, Stanford, University of Illinois, University of Washington, University of Wisconsin, and Wake Forest University.