Current CIIR Projects


Current CIIR Projects l Recent Projects


Lemur/Indri
The Lemur Project is a collaboration with the CIIR and the School of Computer Science at Carnegie Mellon University. The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval, where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, cross-language IR, summarization, filtering, and classification. As part of the Lemur project, the CIIR has developed Indri, a language model-based search engine for complex queries. In an NSF funded CRI collaborative research project between UMass Amherst and CMU, the team is focusing on the continued development of the open-source Lemur software toolkit for language modeling and information retrieval.

Connecting the Ephemeral and Archival Information Networks
This NSF-funded project is a collaboration with the CIIR, Carnegie Mellon University, and RMIT University. The team will use the explicit and implicit links between the ephemeral and archival networks to improve the effectiveness of search that is targeted at social data, web data, or both. Researchers will demonstrate the validity of our hypothesis using a range of existing TREC tasks focused on either social media search or web search. In addition, we will explore two new tasks, conversation search and aggregated social search, which can exploit the integrated network of ephemeral and archival information.

Understanding the Relevance of Text Passages
Developing effective passage retrieval would have a major effect on search tools by greatly extending the range of queries that could be answered directly using text passages retrieved from the web. This is particularly important for mobile search applications with limited output bandwidth based on using either a small screen or speech output. In this case, the ability to use passages to reduce the amount of output while maintaining high relevance will be critical. In this NSF-funded project, we study research issues that have either been ignored, or only partially addressed, in prior research, such as showing whether passages be better answers than documents for some queries, predicting which queries have good answers at the passage level, ranking passages to retrieve the best answers, and evaluating the effectiveness of passages as answers.

Topical Positioning System (TPS) for Informed Reading of Web Pages
This NSF-funded project addresses the challenge of increasing the critical literacy of people looking for information on the Web, including information regarding healthcare, policy, or any other broadly discussed topic. The research on Topical Positioning System "TPS" drives the vision of developing a browser tool that shows a person whether the web page in front of them discusses a provocative topic, whether the material is presented in a heavily biased way, whether it represents an outlier (fringe) idea, and how its discussion of issues relates to the broader context and to information presented in "familiar" sources.

Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR
This NSF-funded Data Intensive Computing project is a collaborative project with Tufts University and the Internet Archive. It aims for transformative advances in current technology, to provide improved automatic support for search and analysis, through the use of data-intensive processing of large corpora. This research is being carried out using a collection of over a million scanned books gathered by the Internet Archive. The collection includes 8.5 terabytes of text and half a petabyte of scanned images. CIIR Researchers will develop new approaches to processing the large collection. The resulting improved corpus will be indexed at the Internet Archive, allowing more accurate and powerful search. Researchers at Tufts University will develop approaches for exploratory data analysis on the processed collection.

Proteus - Supporting Scholarly Information Seeking Through Text-Reuse Analysis and Interactive Corpus Construction
This Andrew W. Mellon Foundation grant is a collaborative project with Northeastern University's NULab for Texts, Maps, and Networks to develop the Proteus toolset for researchers in the digital humanities to explore the contents of large, unstructured collections of historical books (two million out-of-copyright books), newspapers and other documents. Users of the Proteus system will be able to interactively and incrementally build up collections by analyzing networks of text reuse among books, passages, authors, and journals; provide feedback on terms, phrases, named entities, and metadata; and explore these growing collections during search, while browsing, and with an interactive full-text visualization tool.

NSF - Constructing Knowledge Bases by Extracting Entity-Relations and Meanings from Natural Language via "Universal Schema"
This NSF-funded project addresses research in relation extraction of "universal schema," where the researchers learn a generalizing model of the union of all input schemas, including multiple available pre-structured knowledge bases as well as all the observed natural language surface forms. The approach thus embraces the diversity and ambiguity of original language surface forms (not trying to force relations into pre-defined boxes), yet also successfully generalizes by learning non-symmetric implicature among explicit and implicit relations using new extensions to the successful probabilistic matrix factorization and vector embedding methods.

NSF - The Synthesis Genome: Data Mining for Synthesis of New Materials
This NSF-funded project is a collaboration with UMass IESL and MIT. The project's research will develop the framework to do for materials synthesis what modern computational methods have done for materials properties: build predictive tools for synthesis so that targeted compounds can be synthesized in a matter of days, rather than months or years. Researchers will pursue an innovative approach leveraging documentation of compound synthesis compiled over decades of scientific work by using natural language processing (NLP) techniques to automatically extract synthesis methods from hundreds of thousands of peer-reviewed papers.

NSF - Flexible Machine Learning for Natural Language in the MALLET Toolkit
This NSF-funded project aims to to enhance the MALLET (MAchine Learning for LanguagE) and FACTORIE (Factor graphs, Imperative, Extensible), open-source software toolkits. These provide many modern state-of-the-art machine learning methods, specially tuned to be scalable for the idiosyncrasies of natural language data, while also applying well to many other discrete non-language tasks. The research team is broadening these toolkits' applicability to new data and tasks (with better end-user interfaces for labeling, training and diagnostics), enhancing their research-support capabilities (with infrastructure for flexibly specifying model structures), and (3) improving their understandability and support (with new documentation, examples, online community support).

NSF - New Methods to Enhance Our Understanding of the Diversity of Science

This NSF-funded project focuses on the development of analytical tools that capture the diversity of science. The work moves beyond traditional "citation-counting" methods that focus only on the rate of scientific innovation. The project's primary goal is to develop and implement new methods, grounded in the computer science literature (specifically statistical topic modeling and social network analysis), for the analyzing the impact of science policy interventions on the diversity of science.