The movie-search-ml20 dataset

The movie-search-ml20 dataset



This dataset is created based on the questions asked by real users in Yahoo! Answers, a community question answering website. We adopt the data collected by Hagen et al. (2015) from Yahoo! Answers. This dataset focus on the questions in the "movies" category. This is considered as a known-item search task, where the questions are long and descriptive. There is a single relevant movie to each question. We manually mapped each question to its relevant movie ID in the MovieLens 20M dataset. This allows researchers to have both user-item interactions and query-item relevance signal on the same item set. It was originally used for evaluating joint search and recommendation models (Zamani and Croft, 2020).

If you used this dataset, please refer to the following papers:

Do not hesitate to contact Hamed Zamani (, if you have any questions.


The dataset consists of 919 questions. Each question is associated with one movie ID in the MovieLens 20M movie set.

Download: The data is publicaly available for research purposes: click here.

The file is tab-separated (tsv). Each row containts: