The movie-search-ml20 dataset
Background
This dataset is created based on the questions asked by real users in Yahoo! Answers, a community question answering website. We adopt the data collected by Hagen et al. (2015) from Yahoo! Answers. This dataset focus on the questions in the "movies" category. This is considered as a known-item search task, where the questions are long and descriptive. There is a single relevant movie to each question. We manually mapped each question to its relevant movie ID in the MovieLens 20M dataset. This allows researchers to have both user-item interactions and query-item relevance signal on the same item set. It was originally used for evaluating joint search and recommendation models (Zamani and Croft, 2020).
If you used this dataset, please refer to the following papers:
- Hamed Zamani and W. Bruce Croft, "Learning a Joint Search and Recommendation Model from User-Item Interactions". In WSDM 2020.
- Matthias Hagen, Daniel Wagner, and Benno Stein, "A Corpus of RealisticKnown-Item Topics with Associated Web Pages in the ClueWeb09". In ECIR 2015.
Do not hesitate to contact Hamed Zamani (zamani@cs.umass.edu), if you have any questions.
The dataset consists of 919 questions. Each question is associated with one movie ID in the MovieLens 20M movie set.
Download: The data is publicaly available for research purposes: click here.
The file is tab-separated (tsv). Each row containts:
- Timestamp: the time that the question was asked in Yahoo! Answers.
- Subject: the subject of the question asked by the user.
- Content: the content of the question post written by the user.
- Answer Doc ID: the document ID of the relevant movie in the ClueWeb09 collection.
- Answer URL: the URL of the relevant movie (usually a Wikipedia page).
- Answer MovieLens20m ID: the ID of the relevant movie in the MovieLens 20M dataset.