Current CIIR Projects

Current CIIR Projects l Recent Projects

The Lemur Project is a collaboration with the CIIR and the School of Computer Science at Carnegie Mellon University. The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval, where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, cross-language IR, summarization, filtering, and classification. As part of the Lemur project, the CIIR has developed Indri, a language model-based search engine for complex queries. In an NSF funded CRI collaborative research project between UMass Amherst and CMU, the team is focusing on the continued development of the open-source Lemur software toolkit for language modeling and information retrieval.

Connecting the Ephemeral and Archival Information Networks
This NSF-funded project is a collaboration with the CIIR, Carnegie Mellon University, and RMIT University. The team will use the explicit and implicit links between the ephemeral and archival networks to improve the effectiveness of search that is targeted at social data, web data, or both. Researchers will demonstrate the validity of our hypothesis using a range of existing TREC tasks focused on either social media search or web search. In addition, we will explore two new tasks, conversation search and aggregated social search, which can exploit the integrated network of ephemeral and archival information.

Understanding the Relevance of Text Passages
Developing effective passage retrieval would have a major effect on search tools by greatly extending the range of queries that could be answered directly using text passages retrieved from the web. This is particularly important for mobile search applications with limited output bandwidth based on using either a small screen or speech output. In this case, the ability to use passages to reduce the amount of output while maintaining high relevance will be critical. In this NSF-funded project, we study research issues that have either been ignored, or only partially addressed, in prior research, such as showing whether passages be better answers than documents for some queries, predicting which queries have good answers at the passage level, ranking passages to retrieve the best answers, and evaluating the effectiveness of passages as answers.

Transforming Long Queries
The focus of this NSF-funded project is on developing retrieval algorithms and query processing techniques that will significantly improve the effectiveness of long queries. A specific emphasis is on techniques for transforming long queries into semantically equivalent queries that produce better search results. In contrast to purely linguistic approaches to paraphrasing, query transformation is done in the context of, and guided by, retrieval models. Query transformation steps such as stemming, segmentation, and expansion have been studied for many years, and we are both extending and integrating this work in a common framework.

Topical Positioning System (TPS) for Informed Reading of Web Pages
This NSF-funded project addresses the challenge of increasing the critical literacy of people looking for information on the Web, including information regarding healthcare, policy, or any other broadly discussed topic. The research on Topical Positioning System "TPS" drives the vision of developing a browser tool that shows a person whether the web page in front of them discusses a provocative topic, whether the material is presented in a heavily biased way, whether it represents an outlier (fringe) idea, and how its discussion of issues relates to the broader context and to information presented in "familiar" sources.

Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR
This NSF-funded Data Intensive Computing project is a collaborative project with Tufts University and the Internet Archive. It aims for transformative advances in current technology, to provide improved automatic support for search and analysis, through the use of data-intensive processing of large corpora. This research is being carried out using a collection of over a million scanned books gathered by the Internet Archive. The collection includes 8.5 terabytes of text and half a petabyte of scanned images. CIIR Researchers will develop new approaches to processing the large collection. The resulting improved corpus will be indexed at the Internet Archive, allowing more accurate and powerful search. Researchers at Tufts University will develop approaches for exploratory data analysis on the processed collection.

Learning Word Relationships Using TupleFlow
This NSF-funded Cluster Exploratory (CluE) project explores how to use semantic relationships between words and how they can be used to express the same content in order to improve the effectiveness of the ranking. We will find those relationships using the Google/IBM cluster and a new distributed computational framework that was developed at UMass Amherst. The TupleFlow system was developed for the type of indexing and analysis operations that are required for the study of word relationships on a large scale. TupleFlow is an extension of MapReduce, with advantages in terms of flexibility, scalability, disk abstraction, and low abstraction penalties.

Broad Operational Language Technology (BOLT)
The Broad Operational Language Technology (BOLT) Program has a goal of creating technology capable of translating multiple foreign languages in all genres, retrieving information from the translated material, and enabling bilingual communication via speech or text. The CIIR at UMass Amherst is part of the IBM team. The CIIR will be focusing on developing a cross-lingual information retrieval system for informal document genres (e.g., forums).

Proteus Infrastructure: Work Aggregation and Entity Extraction
This Mellon Foundation grant supports development of software and techniques for scholars in the  humanities to use in processing large corpora of digitized books. Specifically, this a pilot project to build and evaluate research infrastructure for scanned books. While there are several large scanned book collections (for example the Internet Archive) much of this is unstructured and not easily used by scholars in the humanities. The grant will support building a Proteus infrastructure which will help scholars navigate and use such collections more easily. Components of the infrastructure include automatically identifying a book’s language, linking multiple editions of canonical works, finding quotations in canonical works, and entity detection.

OCRing Early Modern Text
This Mellon Foundation grant is part of a larger grant to Texas A&M. The teams will recognize the text from the 18th century English books using optical character recognition systems, and they will use their technology to automatically estimate OCR errors and correct the output of multiple OCR engines. This will be done using fast alignment algorithms.

Situation Understanding Bot Through Language and Environment (SUBTLE)
This USAR MURI project is a collaboration with Cornell, George Mason, Stanford, University of Pennsylvania, University of Massachusetts Amherst, and University of Massachusetts Lowell. The researchers are developing methods for constructing a computationally tractable end-to-end system for a habitable subset of English, including both a formal representation of the implicit meaning of utterances and the generation of control programs for a robot platform. They are also developing a virtual simulation of the USAR environment to enable inexpensive large-scale corpus collection to proceed during many stages of system development.The UMass Amherst team is working on the natural language processing and machine learning research on the project, specifically the command interface and command logic.

Flexible Acquisition and Understanding System for Text (FAUST)
This DARPA project is developing an automated machine reading system that makes the information in natural language texts accessible to formal reading systems. UMass Amherst is part of the SRI International team that also consists of Columbia, Stanford, University of Illinois, University of Washington, University of Wisconsin, and Wake Forest University.