nfL6: Yahoo Non-Factoid Question Dataset

The dataset is a json file containing 87,361 questions and their corresponding 
answers selected from Yahoo's Webscope L6 collection using machine learning techniques
such that the questions would contain non-factoid answers.

Each question is accompanied with its best answer and additional other answers
submitted by users.  Only the best answer was examined when determining the quality
of the question-answer pair.

Each component of a question may be accessed with specific keywords to access its parts.
For example, the data in one question dictionary inside the json file may be accessed as
follows:

  obj['question']                The question string
  obj['answer']                  The highest voted answer string
  obj['nbestanswers']            A list of strings representing other submitted answers 
                                 for that question
  obj['main_category']           A string representing the submitted question
  obj['id']                      A unique Yahoo ID string for the question


An example of a script parsing this collection in python is shown below:

  import json

  questions = []
  mydata = json.load(open('nfL6.json','r'))

  for q_a in mydata:
	questions.append(q_a['questions'])


We would like to thank Yahoo for collecting and distributing Webscope L6.