The Lemur Toolkit for Language Modeling and Information Retrieval

Language modeling has recently emerged as an attractive new framework for text information retrieval, leveraging work on language modeling from other areas such as speech recognition and statistical natural language processing. Research carried out at a number of sites has confirmed that the language modeling approach is an effective and theoretically attractive probabilistic framework for building information retrieval (IR) systems.

The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval, where IR is broadly interpreted to include such technologies as ad hoc and distributed retrieval, cross-language IR, summarization, filtering, and classification. The toolkit supports indexing of large-scale text databases, the construction of simple language models for documents, queries, or subcollections, and the implementation of retrieval systems based on language models as well as a variety of other retrieval models. The system is written in the C and C++ languages, and is designed as a research system to run under Unix operating systems, although it can also run under Windows.

The toolkit is being developed as part of the Lemur Project, a collaboration between the Center for Intelligent Information Retrieval, Department of Computer Science at the University of Massachusetts Amherst and the School of Computer Science at Carnegie Mellon University. As an extension to the Lemur project, the CIIR has developed INDRI, a language model-based search engine for complex queries.

More details on the Lemur project and the software download can be found at: http://www.lemurproject.org/.

The Lemur Project is sponsored in part by the Advanced Research and Development Activity in Information Technology (ARDA) under its Statistical Language Modeling for Information Retrieval Research Program and by the National Science Foundation (NSF grants CNS-1405829, CNS-1405045, IIS-1160894, IIS-1160862, CNS-0934322, CNS-0934358, IIS-0948856, IIS-0841275, IIS-0707801, CNS-0454018).