Background The UMass Amherst Center for Intelligent Information Retrieval created two high quality evaluation datasets for non-factoid questions, with multiple answer types associated with them. An example of this type of question and answer types is the following: Question : How do I get rid of mice humanely? Potential Answer Types: Use Traps, Natural Predators. Two levels of data has been provided, which could be used for evaluation. (1) NFPassageQA_Sim : Passage Similarity (2) NFPassageQA_Div : Passage Diversity NFPassageQA_Sim dataset This dataset consists of similarity annotations between pairs of relevant answers for a question. To generate this set, 128 questions with greater than or equal to 10 relevant answers were sampled from the test set of the ANTIQUE dataset [2]. The pairs of relevant answers and the corresponding question were then shown to annotators in the Mechanical Turk platform, to select one of the four similarity labels : 4 : Highly Similar, where both passages answer the question and answers contain the same information, even if they maybe worded differently. 3 : Moderately Similar, where both passages answer the question and belong to the same answer type, but could contain other non-relevant or answer type information 2 : Dissimilar, where both passages answer the question, but answers belong to different answer types. 1 : At least one of the passages does not answer the question. For more details and examples on the annotation process and examples, please refer to the paper [1]. Two files are included, which are tab separated and in the format given below: (a) Query file (queries_sim.txt) , Format : (b) Label file (labels_sim.qrel), Format : NFPassageQA_Div dataset This dataset is created from the similarity passage annotations described above. The passages marked as non-relevant are first identified and removed. Next, questions with greater than or equal 10 relevant answers are retained, which brings the number of questions to 93. All possible combinations based on similarity labels (3,4) are constructed and the longest non-overlapping passage combination is selected as the answer type. To assign relevance judgements to each passage with respect to each answer type, the passages within the original non-overlapping cluster and other passages subsequently added based on partial similarity are considered relevant with label 1 and the rest are marked non-relevant (0). Three files are included, which are tab separated and in the format given below: (a) Query file (queries_div.txt) , Format : (b) Label file (labels_div.qrel), Format : (c) Non-relevant passages (nrel_pass.txt), Format: