The following material is available for download from the CIIR. It is provided without warranty and without support. If there are problems accessing or using any of this material, we would appreciate being told (info at, in case we can address the issue.

DeepMerge: Merging Multiple Search Result Lists - December 2015

C.J. Lee, Bruce Croft

DeepMerge Merging of Multiple Search Result Lists

nfL6: Yahoo Non-Factoid Question Dataset - November 2015

Daniel Cohen, Bruce Croft

This dataset contains 87,361 questions derived from the Yahoo Webscope L6 collection and their corresponding best, and additional, answers submitted by users. Only the best answer was reviewed in determining answer quality.

The dataset file is in JSON format and may be downloaded in rar or gzip compressed format.

nfL6: Yahoo Non-Factoid Question Dataset

"Search Engines" Galago Source Code

Trevor Strohman

An old and now obsolete version (1.04) of the Galago Java source code that was referenced as a learning resource in the textbook "Search Engines: Information Retrieval in Practice" by Croft, Metzler and Strohman (2009).

It has been moved from its former Google Code repository to here.

"Search Engines" Galago Source Code

Web Annotated Passages - September 2015 Dataset

Liu Yang, Bruce Croft

This dataset contains 7,499 documents with four grades of sentence level relevance judgment annotations for 82 queries derived from the TREC Gov2 web collection. The dataset is archived in rar or gzipʼed tar file formats. The dataset is described and used in Keikha, Park and Croft, SIGIR 2014.

Web Annotated Passages Dataset

Wikipedia Bullet Points June 2013 Dataset

John Foley, James Allan

This dataset contains over 40,000 bullet-point "facts" mined from English Wikipedia year pages in the June 2013 english XML dump. (3.3M, gzipped JSON)

Wikipedia Bullet Points Dataset

Twitter June - July 2014 Dataset

Nada Naji, James Allan

This dataset consists of 71,564,914 Twitter IDs that have been automatically crawled over the period from mid June through early July 2014.

Wikipedia IDs Dataset

Query Facet and Facet Feedback Annotations

Weize Kong, James Allan

This dataset consists of query facet and facet feedback annotations used in work, "Weize Kong and James Allan. Extending Faceted Search to the General Web. CIKM 2014".

Faceted Web Search Dataset

KB Bridge Entity Linking System

Jeffrey Dalton, Pat Verga, Laura Dietz

KB Bridge is an entity linking system which identifies named entities in free text and links them to entries in a semistructured knowledge base, such as Freebase or Wikipedia. See also Dalton, J. and Dietz, L., "A Neighborhood Relevance Model for Entity Linking," OAIR 2013.

KB Bridge Code and Instructions

Online Appendix for Entity Query Feature Expansion with Knowledge Base Links

Jeffrey Dalton, Laura Dietz, James Allan

This online appendix provides additional material for entity query feature expansion, such as additional gold standard annotations, produces rankings, entity-based features, and software. See also Dalton, J., Dietz, L. and Allan, J., "Entity Query Feature Expansion using Knowledge Base Links," SIGIR 2014.

Entity Query Feature Expansion Appendix

Controversy Annotation Dataset

Shiri Dori-Hacohen, James Allan

This collection consists of controversy annotations for 445 webpages and 2060 Wikipedia articles. See also Dori-Hacohen, S. and Allan, J., "Detecting Controversy on the Web," CIKM 2013; Dori-Hacohen, S. and Allan, J., "Automated Controversy Detection on the Web," ECIR 2015.

Controversy Annotation Dataset

Open Library

Henry Feild

This collection consists of 46,561,553 metadata records crawled from the Open Library on November 30, 2011 and click distributions over records for 22,622 queries recorded over the year October 2010 through September 2011.

Open Library Dataset

RETAS OCR Evaluation Dataset

Zeki Yalniz, R. Manmatha

This dataset was created to evaluate the optical character recognition (OCR) accuracy of scanned books. It is provided here for research purposes. The dataset is extracted from books in Project Gutenberg and the Internet Archive.

RETAS OCR Evaluation Dataset

Book Translation Detection Dataset

Zeki Yalniz, R. Manmatha, Kriste Krstovski, David A. Smith

These datasets were created to evaluate the effectiveness of the translation detection frameworks. The 2K dataset was created by Krstovski and Smith (2011), and the list of translation pairs was updated for use in Yalniz and Manmatha (2012). The 30-book dataset was created for use in Yalniz and Manmatha (2012). This dataset is for research purposes only.

Translation Detection Dataset

Book Duplicate Detection Dataset

Zeki Yalniz, E. F. Can, R. Manmatha

This dataset was created to evaluate the effectiveness of the partial duplicate detection framework for scanned books proposed by Yalniz, Can and Manmatha (2011). This dataset is for research purposes only.

Duplicate Detection Dataset

Searcher Frustration User Study Data

Henry Feild

This is a dataset collected during a user study of frustration during web search at the University of Massachusetts Amherst in October 2009. The study consists of query logs and sensor readings for thirty participants.

This is available under an Open Database/Database Content license. Feel free to use, redistribute, and modify the dataset, but make sure to make it available under the same license and to give due attribution in any public use of the dataset.

Searcher Frustration Data Set

Word Image Data Sets

Toni Rath

Data sets containing word images from the George Washington collection with meta-data for retrieval performance evaluation.

Image Datasets

Stemming Class from Stemming and Cooccurrence on a Larger Corpus

Jeremy Pickens

Three sets of experiments were done, using initial classes created by (1) the Porter stemmer, (2) K-Stem, and (3) the Porter stemmer classes merged in a connected component manner with the K-Stem classes.


Event Threading Experiment

Nallapati, R., Feng, A., Peng, F., and Allan, J.

This is the experimental data from "Event Threading within News Topics" in the Proceedings of CIKM 2004 conference, pp. 446-453.

Event Threading

Novelty Track, TREC

Alvaro Bolivar

This is a collection building toolkit to assemble the training set used by the CIIR in its participation in the Novelty track at TREC 2002. For details check conference proceedings.