Download Center
Before proceeding with any download, check our "License Agreement" .
We would also like you to consider registering. It only takes a minute and will give us a contact to inform of any updates to the software.
List of available downloads:
Question Classifier
Wei LiLast Updated: Mar. 28, 2002
This is a question classifier that maps natural-language questions into entity types, like PERSON, LOCATION, NUMBER, and so on. It contains three different models: one rule-based question pattern model and two probabilistical language models.
Novelty Track, TREC
Alvaro BolivarLast Updated: Jan. 14, 2003
This is a collection building toolkit to assemble the training set used by the CIIR in its participation in the Novelty track at TREC 2002. For details check conference proceedings.
KStem Java Implementation
Sergio Guzman-LaraLast Updated: Apr. 27, 2007
This is the source code of a java implementation of kstem (a stemmer designed by Bob Krovetz). In particular, this implementation is adapted for Lucene. To install, download KStem.jar to Lucene's src directory and unjar it there. Then com pile.
Word Image Data Sets *Requires Registration*
Toni RathLast Updated: Jan. 07, 2003
Data sets containing word images from the George Washington collection with meta-data for retrieval performance evaluation.
Table Extractor
David Pinto and Xing WeiLast Updated: Mar. 29, 2003
This software package is a table tagger, which processes text tables in documents by tagging each cell. The input is a file that may have many documents in it. The outputs are the processed table cells, extracted tables, tagged file and non-table text.
IESL
The IESL LabDownloadable code and data from the Information Extraction and Synthesis Laboratory (IESL) can be found at http://www.cs.umass.edu/~mccallum/code-data.html.
Event Threading experiment
Nallapati, R., Feng, A., Peng, F., and Allan, J.Last Updated: Feb. 22, 2005
This is the experimental data from "Event Threading within News Topics" in the Proceedings of CIKM 2004 conference, pp. 446-453.
Indri
The Lemur ProjectIndri is a new search engine from the Lemur project; a cooperative effort between the University of Massachusetts Amherst and Carnegie Mellon University to build language modeling information retrieval tools.
Stemming Class from Stemming and Cooccurrence on a Larger Corpus
Jeremy PickensThree sets of experiments were done, using initial classes created by (1) the Porter stemmer, (2) K-Stem, and (3) the Porter stemmer classes merged in a connected component manner with the K-Stem classes.