README
Created: 28-Jul-2013


DEFINITIONS

    Session: 
        As defined for the TREC Session Track---all queries issued for a given
        topic, even if there are sub-topics.)

    Task:
        In the case of our data, tasks are synonymous with sessions. 

    On-task:
        A query is considered on-task within a context if it shares the same
        task as the reference query in that context.

    Off-task:
        If a query is not on-task, then it is off-task.

    Reference query:
        In a context [q1, q2, ..., qn], where q1 was the least recent query
        submitted, we refer to the last query (most recently submitted), qn, as
        the reference query.

    Context types:
        on-task context: 
            All queries in the context are part of the same task.

        off-task context:
            None of the context queries is part of the same task as the 
            reference query. Off-task queries were randomly samples from other
            TREC sessions, making sure to ignore sessions with the same topic.

        mixed-task context:
            Some of the queries in the context are part of the same task as
            the reference query; others are not (as few as 0 and as many as 10
            in the case of our data) . See the paper [1] for details about how
            off-task queries were sampled.


FILES

Each file consists of many contexts. One TREC session is associated with 
multiple contexts in each file. For example, in `on-task-contexts.json.gz`,
A session with five (5) queries will have five (5) corresponding contexts---one
with just last query, one with the last two (2) queries, one with the last
three (3), one with the last four (4), and one with all five (5) queries. 
Contexts that share the same reference query will have the same `sessionID`
(see the DATA FORMAT section below). 

The JSON object represented on each line has many parts---see the DATA FORMAT
section for a complete listing. The main parts are: 

    * information about the context, e.g., it's session id

    * an array of context queries, listing the query, whether or not it is on 
      task, and our prediction that it is on task

    * a list of recommendations under different models given the context. Each
      recommendation object consists of details about the model and the top five
      (5) scoring recommendations using that model

The files present are:

    mixed-contexts.json.gz:
        Each context consists of the original session and between 1--10 noisy
        queries. There are 50 randomizations for each noisy query level.

    off-task-contexts.json.gz:
        Each context consists of the reference query from the original session.
        However, all of the other queries in the context are sampled from other
        off-task sessions. The number of other queries ranges from 1 to 
        |session|-1.

    on-task-contexts.json.gz:
        Each context consists of the last n queries from the corresponding TREC
        session, where n ranges from 1 to |session|.


DATA FORMAT

Each line of the *.json files is a JSON string with this format.

{
    "sessionID": <id>,
    "trecSessionYear:" <year>,
    "trecSessionID": <session id>,
    "trecSessionTopicID": <topic id>,
    "contextLength": <length>,
    "noisyQueryCount": <count>,
    "context": [
        {
            "query": <query>,
            "onTask": [true|false],
            "onTaskPrediction": <prediction score>
        }
    ]
    "recommendations": [
        {
            "beta": <beta>,
            "lambda": <lambda>,
            "model": ["decay"|"hardTask"|"firmTask1"|"firmTask2"|"softTask"]
            "queries": [
                {
                    "query": <query>,
                    "score": <score>
                },
                ...
            ]
        },
        ...
     ]
}

The context array is sorted by submission order (i.e., the first one was the 
first submitted); as such. the last one is the reference query. The 
recommendations[*]["queries"] array is sorted in non-ascending order of score.
Only the top five (5) recommendations are given.

To use the TREC qrels, consider the `trecSessionYear` and `trecSessionTopicID`
fields.


[1] Task-Aware Query Recommendation (Feild & Allan, SIGIR 2013); URL: 
http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1091