Current CIIR Projects
Current CIIR Projects l Recent Projects
Lemur/Indri
The Lemur Project is collaboration between the CIIR and the School of Computer Science at Carnegie Mellon University. The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval, where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, cross-language IR, summarization, filtering, and classification. As part of the Lemur project, the CIIR has developed Indri, a language model-based search engine for complex queries. In an NSF funded CRI collaborative research project between UMass Amherst and CMU, the team is focusing on the continued development of the open-source Lemur software toolkit for language modeling and information retrieval.
Nightingale
The CIIR is embarking on a five-year DARPA project under the Global Autonomous Language Exploitation (GALE) program. The goal of GALE is make foreign language
(Arabic and Chinese) speech and text accessible to English monolingual people, particularly in military settings. The Nightingale research team includes UMass Amherst, Columbia University, International Computer Science Institute (ICSI), IDIAP Research Institute, HNC/Fair Isaac Corporation, New York University, National Research Council (NRC) Canada, Purdue University, RWTH Aachen University, University of California San Diego, University of Washington, Systran Software, and SRI International. The UMass Amherst team focuses on highly accurate retrieval, dynamic topic models, social network discovery, and statistical
machine translation.
Searching Archives of Community Knowledge
In this NSF-funded project, we are studying the task of finding good answers in Collaborative Question Answering archives by investigating techniques for question retrieval and comparing them to alternatives such as direct answer retrieval. The techniques that we are developing to search the CQA archives also have the potential to have a significant impact on all types of search engines. The large CQA archives can be used as training data for models of text transformation. In other words, by developing models that learn how to recognize questions using these resources, we will also be learning how concepts or topics can be expressed in different ways. These transformation models could then be used to significantly improve the robustness of the topic models used in search engines, which will in turn substantially improve the effectiveness of the system.
Text Reuse and Information Flow
In this NSF-funded project, we are studying a range of approaches to detecting reuse at the sentence level, and a range of approaches for combining sentence-level evidence into document-level evidence. We are also developing algorithms for inferring information flow from timelines, sources, and reuse measures. Given the importance of the Web as a source for detecting reuse, we also focus on techniques that can make efficient use of this huge but unwieldy resource. The research is being evaluated using a range of corpora, such as news, Web crawls, and blogs, in order to explore the dimensions of reuse and information flow in different situations.
Learning Word Relationships Using TupleFlow
This NSF-funded Cluster Exploratory (CluE) project explores how to use semantic relationships between words and how they can be used to express the same content in order to improve the effectiveness of the ranking. We will find those relationships using the Google/IBM cluster and a new distributed computational framework that was developed at UMass Amherst. The TupleFlow system was developed for the type of indexing and analysis operations that are required for the study of word relationships on a large scale. TupleFlow is an extension of MapReduce, with advantages in terms of flexibility, scalability, disk abstraction, and low abstraction penalties.
Machine Learning for Sequences and Structured Data: Tools for Non-Experts
In this NSF-funded ITR collaborative research project between UMass Amherst, UPenn, and CMU, the team is researching ways to dramatically improve the ability of people who are not experts in machine learning to design and automatically train models for analyzing and transforming sequences and other structured data such as text, signals, handwriting, and biological sequences.
Unified Graphical Models
"Unified Graphical Models of Information Extraction and Data Mining with Application to Social Network Analysis" is an NSF ITR research project that aims to improve the ability to data mine information previously locked in unstructured natural language text. The research focuses on developing novel statistical models for information extraction and data mining that have such tight integration that the boundaries between them disappear, resulting in a powerful unified framework for extraction and mining.
Statistical Models for Information Extraction for REFLEX
In this project, UMass Amherst is a subcontractor to BBN Technologies on a DARPA-sponsored project to develop statistical models for information extraction that combine many sources of information in novel, integrated ways.
Automated Diagnosis of Usability Problems Using Statistical Computational Methods
The effects of poor usability range from mere inconvenience to disaster. Human factors specialists employ usability analysis to reduce the likelihood or impact of such failures. However, good usability analysis requires usability reports that are rarely collected, rarely complete, and difficult to analyze.The CIIR and Aptima have partnered on this AFOSR STTR project to develop a usability analysis system that addresses these problems.
CALO Project
As part of DARPA’s Perceptive Agent that Learns (PAL) program, SRI and team members including the CIIR are working on developing a next-generation "Cognitive Agent that Learns and Organizes" (CALO).
Confidence Measures for Information Extraction of Entities, Relations and Object Correspondence
In this NSF KDD project, UMass Amherst intends to improve the state-of-the-art in the ability to associate confidence measures with information extracted from unstructured text. The team will build on its previously successful research in probabilistic models for confidence assessment of individual extracted text segments, and will provide new capabilities for confidence assessment of object correspondence, and relations between entities.