Completed Projects

This research project, Relevance Models and Answer Granularity for Question Answering, is an ARDA initiative under the Advanced Question and Answering for Intelligence (AQUAINT) program.

Automated Diagnosis of Usability Problems Using Statistical Computational Methods
The effects of poor usability range from mere inconvenience to disaster. Human factors specialists employ usability analysis to reduce the likelihood or impact of such failures. However, good usability analysis requires usability reports that are rarely collected, rarely complete, and difficult to analyze.The CIIR and Aptima have partnered on this AFOSR STTR project to develop a usability analysis system that addresses these problems.

Broad Operational Language Technology (BOLT)
The Broad Operational Language Technology (BOLT) Program has a goal of creating technology capable of translating multiple foreign languages in all genres, retrieving information from the translated material, and enabling bilingual communication via speech or text. The CIIR at UMass Amherst is part of the IBM team. The CIIR will be focusing on developing a cross-lingual information retrieval system for informal document genres (e.g., forums).

CALO Project
As part of DARPA’s Perceptive Agent that Learns (PAL) program, SRI and team members including the CIIR are working on developing a next-generation "Cognitive Agent that Learns and Organizes" (CALO).

Confidence Measures for Information Extraction of Entities, Relations and Object Correspondence
In this NSF KDD project, UMass Amherst intends to improve the state-of-the-art in the ability to associate confidence measures with information extracted from unstructured text. The team will build on its previously successful research in probabilistic models for confidence assessment of individual extracted text segments, and will provide new capabilities for confidence assessment of object correspondence, and relations between entities.

Discovering and Using Meta-Terms (Microsoft Live Labs)
Sponsored by Microsoft Live Labs' Accelerating Search in Academic Research Initiative, project researchers will use Microsoft query logs and another Web-based colleciton to develop techniques to discover meta-terms in queries and then mine related words from the Web in an effort to test various approaches to query reformulation or transformation.

Flexible Acquisition and Understanding System for Text (FAUST)
This DARPA project is developing an automated machine reading system that makes the information in natural language texts accessible to formal reading systems. UMass Amherst is part of the SRI International team that also consists of Columbia, Stanford, University of Illinois, University of Washington, University of Wisconsin, and Wake Forest University.

Machine Learning for Sequences and Structured Data: Tools for Non-Experts
In this NSF-funded ITR collaborative research project between UMass Amherst, UPenn, and CMU, the team is researching ways to dramatically improve the ability of people who are not experts in machine learning to design and automatically train models for analyzing and transforming sequences and other structured data such as text, signals, handwriting, and biological sequences.

The CIIR is embarking on a five-year DARPA project under the Global Autonomous Language Exploitation (GALE) program. The goal of GALE is make foreign language (Arabic and Chinese) speech and text accessible to English monolingual people, particularly in military settings. The Nightingale research team includes UMass Amherst, Columbia University, International Computer Science Institute (ICSI), IDIAP Research Institute, HNC/Fair Isaac Corporation, New York University, National Research Council (NRC) Canada, Purdue University, RWTH Aachen University, University of California San Diego, University of Washington, Systran Software, and SRI International. The UMass Amherst team focuses on highly accurate retrieval, dynamic topic models, social network discovery, and statistical machine translation.

NSF Digital Government Project
This research project, A Language-Modeling Approach to Metadata for Cross-Database Linkage and Search, is a National Science Foundation sponsored initiative. The CIIR is working in collaboration with Carnegie Mellon University, the Library of Congress, Department of Commerce, U.S. Geological Survey, and R.I.S.C. on this project.

NSF Learning Word Relationships Using TupleFlow
This NSF-funded Cluster Exploratory (CluE) project explores how to use semantic relationships between words and how they can be used to express the same content in order to improve the effectiveness of the ranking. We will find those relationships using the Google/IBM cluster and a new distributed computational framework that was developed at UMass Amherst. The TupleFlow system was developed for the type of indexing and analysis operations that are required for the study of word relationships on a large scale. TupleFlow is an extension of MapReduce, with advantages in terms of flexibility, scalability, disk abstraction, and low abstraction penalties.

NSF IDM Mongrel Project
"Supporting Effective Access through User- and Topic-Based Language Models" is a research project in collaboration with Rutgers University and sponsored by the NSF IDM program.

NSF NSDL - Search and Browsing Support for NSDL
On an NSF National Science, Mathematics, Engineering, and Technology Education Digital Libary (NSDL) project, the CIIR worked with a team of institutions that are developing the technical capabilities and executing the organizational responsibilities of the core integration of the NSDL Program.

NSF NSDL - Question Triage for Experts and Documents: Expanding the Information Retrieval Function of the NSDL
On an NSF National Science, Mathematics, Engineering, and Technology Education Digital Libary (NSDL) project, the CIIR is partnering with the Information Institute of Syracuse (IIS) and the Wondir Foundation to enhance the NSDL by merging the information retrieval (IR) and digital reference components. By combining these functions, users can find answers to their questions regardless if those answers come from documents in NSDL collections or experts accessible through the NSDL's virtual reference desk.

NSF - Searching Archives of Community Knowledge
In this NSF-funded project, we are studying the task of finding good answers in Collaborative Question Answering archives by investigating techniques for question retrieval and comparing them to alternatives such as direct answer retrieval. The techniques that we are developing to search the CQA archives also have the potential to have a significant impact on all types of search engines. The large CQA archives can be used as training data for models of text transformation. In other words, by developing models that learn how to recognize questions using these resources, we will also be learning how concepts or topics can be expressed in different ways. These transformation models could then be used to significantly improve the robustness of the topic models used in search engines, which will in turn substantially improve the effectiveness of the system.

NSF SGER: Breaking the keyword bottleneck: Towards more effective access of government information
In this NSF project, we are carrying out initial experiments with retrieval models for complex queries that go beyond the typical “bag-of-words” approach. There are two major issues that we explore in the development of new retrieval models. First, in order to improve system robustness, we need to develop models that more reliably capture topical relevance than our current models. This means we need to have models that are better at recognizing different ways that topics can be described in text. Second, in order to improve the system accuracy in the top ranked documents, we need to develop models that more precisely capture topical relevance. This means retrieval models need to be better at recognizing and incorporating the specific concepts and relationships that are required by the query.

NSF - Text Reuse and Information Flow
In this NSF-funded project, we are studying a range of approaches to detecting reuse at the sentence level, and a range of approaches for combining sentence-level evidence into document-level evidence. We are also developing algorithms for inferring information flow from timelines, sources, and reuse measures. Given the importance of the Web as a source for detecting reuse, we also focus on techniques that can make efficient use of this huge but unwieldy resource. The research is being evaluated using a range of corpora, such as news, Web crawls, and blogs, in order to explore the dimensions of reuse and information flow in different situations.

NSF - Transforming Long Queries
The focus of this NSF-funded project is on developing retrieval algorithms and query processing techniques that will significantly improve the effectiveness of long queries. A specific emphasis is on techniques for transforming long queries into semantically equivalent queries that produce better search results. In contrast to purely linguistic approaches to paraphrasing, query transformation is done in the context of, and guided by, retrieval models. Query transformation steps such as stemming, segmentation, and expansion have been studied for many years, and we are both extending and integrating this work in a common framework.

PCORI - Patient Experience Recommender System
The focus of this PCORI pilot project is to maximize patient perspective and effectively support lifestyle choices by developing the "Patient Experience Recommender System for Persuasive Communication Tailoring" (PERSPeCT), an adaptive computer system that assesses a patient's individual perspective, understands the patient's preference for health messages, and provides personalized, persuasive health communication relevant to the individual patient. The project is a collaboration with UMass Medical Division of Health Informatics and Implementation Science and the CIIR.

Proteus Infrastructure: Work Aggregation and Entity Extraction
This Mellon Foundation grant supports development of software and techniques for scholars in the humanities to use in processing large corpora of digitized books. Specifically, this a pilot project to build and evaluate research infrastructure for scanned books. While there are several large scanned book collections (for example the Internet Archive) much of this is unstructured and not easily used by scholars in the humanities. The grant will support building a Proteus infrastructure which will help scholars navigate and use such collections more easily. Components of the infrastructure include automatically identifying a book’s language, linking multiple editions of canonical works, finding quotations in canonical works, and entity detection.

OCRing Early Modern Text
This Mellon Foundation grant is part of a larger grant to Texas A&M. The teams will recognize the text from the 18th century English books using optical character recognition systems, and they will use their technology to automatically estimate OCR errors and correct the output of multiple OCR engines. This will be done using fast alignment algorithms.

Statistical Models for Information Extraction for REFLEX
In this project, UMass Amherst is a subcontractor to BBN Technologies on a DARPA-sponsored project to develop statistical models for information extraction that combine many sources of information in novel, integrated ways.

The research project, Tools for Rapidly Adaptable Translingual Information Retrieval and Organization, is a DARPA-sponsored initiative under the Translingual Information Detection, Extraction, and Summarization (TIDES) program on fast machine translation and information access. Another project, Formal Frameworks and Empirical Evaluations for Information Organization, is a continuation of the DARPA-sponsored TIDES initiative.

Topic Detection and Tracking (TDT)
DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories.

Unified Graphical Models
"Unified Graphical Models of Information Extraction and Data Mining with Application to Social Network Analysis" is an NSF ITR research project that aims to improve the ability to data mine information previously locked in unstructured natural language text. The research focuses on developing novel statistical models for information extraction and data mining that have such tight integration that the boundaries between them disappear, resulting in a powerful unified framework for extraction and mining.

USAR MURI - Situation Understanding Bot Through Language and Environment (SUBTLE)
This USAR MURI project is a collaboration with Cornell, George Mason, Stanford, University of Pennsylvania, University of Massachusetts Amherst, and University of Massachusetts Lowell. The researchers are developing methods for constructing a computationally tractable end-to-end system for a habitable subset of English, including both a formal representation of the implicit meaning of utterances and the generation of control programs for a robot platform. They are also developing a virtual simulation of the USAR environment to enable inexpensive large-scale corpus collection to proceed during many stages of system development.The UMass Amherst team is working on the natural language processing and machine learning research on the project, specifically the command interface and command logic.

Word Spotting: Indexing Handwritten Manuscripts
Word Spotting is sponsored by the National Science Foundation Digital Libraries II program. This project researches and develops innovative techniques for indexing handwritten historical manuscripts written by a single author.