nfL6: Yahoo Non-Factoid Question Dataset

nfL6: Yahoo Non-Factoid Question Dataset


 


Background


This dataset is derived from Yahoo's Webscope L6 collection using machine learning techiques such that the questions would contain non-factoid answers.

The dataset contains 87,361 questions and their corresponding answers. Each question contains its best answer along with additional other answers submitted by users. Only the best answer was reviewed in determining the quality of the question-answer pair.


The data fields correspond to dictionaries containing the information. These dictionaries comprise the following:

question
The string representing the question.

answer
A string representing the best answer to the question.

nbestanswers
One or more strings representing other submitted answers to the question.

main_category
A string representing the Yahoo category for the submitted question.

id
The unique Yahoo ID string the question.

Components of a question may be obtained using specific keywords to access its parts. For example, the data in one question dictionary inside the json file may be accessed as follows:

          obj['question']         The question string
          obj['answer']           The highest voted answer string
          obj['nbestanswers']     A list of strings representing other submitted answers for the question
          obj['main_category']    A string representing the submitted question
          obj['id']               A unique Yahoo's ID string for the question
        

An example script parsing this collection in python is shown below:

          import json

          questions = []
          mydata = json.load(open('nfL6.json','r'))

          for q_a in mydata:
            questions.append(q_a['questions'])
        

Email Downloads for questions or comments concerning the dataset or this web page.



Dataset


This collection consists of a README.txt file containing information about the dataset and the dataset itself as compressed rar nfL6.json.rar or gzip'ed nfL6.json.gz files.


Download


Uncompress the rar archive using 7zip or rar/unrar on Windows machines. Both these utilities may also be installed on Unix machines.
     rar x nfL6.json.rar
     7zip x nfL6.json.rar

On Unix machines, uncompress the file using gunzip:.
     gunzip nfL6.json.gz


File Name
Compressed
Size
Uncompressed
Size
README.txt
---
1.2K
Gzip'ed nfL6 JSON data file
48M
149M
Rar nfL6 JSON data file
39M
149M


Acknowledgements


We would like to thank Yahoo for collecting and distributing Webscope L6.