CRI: CRD - Supporting User Data, Privacy, and Evaluation in the Lemur Toolkit
PI: W. Bruce Croft, Distinguished Professor, Computer Science, University of Massachusetts Amherst
Co-PI: Jamie Callan, Professor, Computer Science, Carnegie Mellon University
The Lemur Toolkit, abbreviated Lemur in this proposal, was developed by Carnegie Mellon University and the University of Massachusetts Amherst. Lemur was initially developed to support research at the two universities and to speed up technology transfer from our two universities to other research groups. Over the last eight years, it has become a flexible, state-of-the-art research platform for a broad community of researchers studying information retrieval, question answering, and other text and language analysis tasks. Lemur is not a general-purpose web search engine (e.g., Google), and it does not compete with commercial IR systems.
The Lemur Toolkit is a set of indexing modules and retrieval models that share a common API, and a set of useful applications (search, clustering, etc). From this toolkit, a person can select the indexing method, retrieval algorithm, and application that best fits a specific task.
Lemur supports most of the popular retrieval models (e.g., vector-space, tf.idf, Okapi, InQuery), as well as newer retrieval models based on statistical language models (e.g., KL-divergence, relevance models). It also supports retrieval in other languages (e.g., Arabic, Chinese), structured documents (e.g., HTML, simple XML), document annotation (e.g., part-of-speech, named entity), sophisticated structured queries, federated search of many text search engines, and clustering. Lemur is designed to be extensible; new indexing techniques, retrieval methods, and tasks (“applications”, in Lemur jargon) can be added without major disruption to the rest of the system. UMass and CMU continue to add new techniques and tasks to Lemur, as needed by, and funded by, other projects.
The most advanced of Lemur’s indexing and retrieval components is the Indri search engine. Indri supports up to a few terabytes of text, structured (HTML, simple XML) documents, and extensive text annotations (part-of-speech, named-entity, etc). Indri also supports a sophisticated, flexible, and somewhat extensible query language that allows queries to reference document content, metadata (attributes), annotations, and structure. Most of the recent information retrieval research at UMass and CMU is done with Indri, and it has quickly been adopted by the broader research community.
We have also recently added the Lemur toolbar, a configurable application for web browsers that monitors a variety of user actions, collects the data and forwards it to a server. The first release supported Firefox (June, 2008). The second release added support for Internet Explorer, improved support for multilingual environments, added the ability to track advertising, and improved the ability to add new search engines. The third release (June 2009) will include more upload options, to better support a variety of usage scenarios. The toolbar is used with the Lemur transaction database, a SQL database application designed to support search and analysis of large amount of user data, and the Lemur privacy tools, a collection of filters for providing different levels of privacy for the user data.
Another recent addition is the Lemur evaluation tool, a package that includes all commonly-used evaluation measures for IR experiments and emphasizes new user-oriented measures such as NDCG.
As part of the Lemur Toolkit 4.9 release (April 6, 2009), we have made a number of improvements to Indri to enable processing of the ClueWeb09 billion web page corpus [http://boston.lti.cs.cmu.edu/Data/clueweb09/]. We have added a WARC input processor. We additionally have optimized a number of the processing steps, especially in the input parsing and in memory indexing phases, producing a 10%-15% speedup in indexing time for collections of 25-50 million documents. These activities have identified a number of candidates for redesign as the collections scale up to one billion documents. While the 4.9 release makes it possible to index ClueWeb09, robust and efficient support for collections of that size will be a major focus for future development.
In addition to maintaining and improving the Lemur toolkit, we also carry out research aimed at new search functionality that will be supported by Lemur. The main example of this activity during the past year is the development and evaluation of new retrieval models for semi-structured data.
One of our major initiatives in the area of education is based on the recent publication of the textbook Search Engines: Information Retrieval in Practice by Croft and two of his ex-students (both of whom worked on the Lemur project). This is the first textbook on search engines and information retrieval aimed primarily at undergraduates. It is designed to give students the understanding and tools they need to evaluate, compare and modify search engines. The programming exercises in the book make extensive use of Galago, a Java-based open source search engine developed by one of the authors, Trevor Strohman, who was also the primary developer for the Indri search engine.
Graduate students and Researchers/Programmers involved in this project:
* David Fisher, Senior Software Engineer, University of Massachusetts Amherst
* Mark Hoy, Senior Research Programmer, Carnegie Mellon University
* Samuel Huston, Graduate Student, University of Massachusetts Amherst
* Jinyoung Kim,
Graduate Student, University of Massachusetts Amherst
* Anagha Kulkarni,
Graduate Student, Carnegie Mellon University
* Le Zhao, Graduate Student, Carnegie Mellon University
* Yangbo Zhu, Graduate Student, Carnegie Mellon University
More details on the Lemur project and the software download can be found at: http://www.lemurproject.org/.
Recent Publications:
Kim, J., Xue, X., and Croft, W.B., "A Probabilistic Retrieval Model for Semistructured Data," to appear in the Proceedings of the 31st European Conference on Information Retrieval (ECIR 09) Toulouse, France, April 6-9, 2009.
Kim, J., and Croft, W.B., "Defining the Desktop Search Problem : Test Collection Generation and Evaluation," submitted to the 32nd Annual ACM SIGIR conference, Boston, MA, USA, July 19-23, 2009.
Y. Zhu, J. Callan, and J. Carbonell, "The impact of history length on personalized search (Poster Description)", in the Proceedings of the Thirty First Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Singapore, p. 715-718, (2008).
Y. Zhu, L. Zhao, and J. Callan., "Structured queries for legal documents search", in the Electronic Proceedings of the 2007 Text REtrieval Conference (TREC 2007). National Institute of Standards and Technology, special publication., (2007).
Y. Zhu, J. Callan, and J. Carbonell., "Patterns of query reformulation and personalized search", submitted to the Seventeenth International Conference on Information and Knowledge Management (CIKM'08), (2008).
L. Zhao and J. Callan, "A generative retrieval model for structured documents", In theProceedings of the Seventeenth International Conference on Information and Knowledge Management (CIKM '08)", pp. 1163-1172, (2008).
L. Zhao and J. Callan, "Effective and efficient structured retrieval", Submitted to the Thirty Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (2009).
Details on an NSF funded Lemur project completed in June 2008: CRI: Developing the Lemur Toolkit into a Community Resource
This project is sponsored by the National Science Foundation grant #0707801.