README Created: 28-Jul-2013 DEFINITIONS Session: As defined for the TREC Session Track---all queries issued for a given topic, even if there are sub-topics.) Task: In the case of our data, tasks are synonymous with sessions. On-task: A query is considered on-task within a context if it shares the same task as the reference query in that context. Off-task: If a query is not on-task, then it is off-task. Reference query: In a context [q1, q2, ..., qn], where q1 was the least recent query submitted, we refer to the last query (most recently submitted), qn, as the reference query. Context types: on-task context: All queries in the context are part of the same task. off-task context: None of the context queries is part of the same task as the reference query. Off-task queries were randomly samples from other TREC sessions, making sure to ignore sessions with the same topic. mixed-task context: Some of the queries in the context are part of the same task as the reference query; others are not (as few as 0 and as many as 10 in the case of our data) . See the paper [1] for details about how off-task queries were sampled. FILES Each file consists of many contexts. One TREC session is associated with multiple contexts in each file. For example, in `on-task-contexts.json.gz`, A session with five (5) queries will have five (5) corresponding contexts---one with just last query, one with the last two (2) queries, one with the last three (3), one with the last four (4), and one with all five (5) queries. Contexts that share the same reference query will have the same `sessionID` (see the DATA FORMAT section below). The JSON object represented on each line has many parts---see the DATA FORMAT section for a complete listing. The main parts are: * information about the context, e.g., it's session id * an array of context queries, listing the query, whether or not it is on task, and our prediction that it is on task * a list of recommendations under different models given the context. Each recommendation object consists of details about the model and the top five (5) scoring recommendations using that model The files present are: mixed-contexts.json.gz: Each context consists of the original session and between 1--10 noisy queries. There are 50 randomizations for each noisy query level. off-task-contexts.json.gz: Each context consists of the reference query from the original session. However, all of the other queries in the context are sampled from other off-task sessions. The number of other queries ranges from 1 to |session|-1. on-task-contexts.json.gz: Each context consists of the last n queries from the corresponding TREC session, where n ranges from 1 to |session|. DATA FORMAT Each line of the *.json files is a JSON string with this format. { "sessionID": , "trecSessionYear:" , "trecSessionID": , "trecSessionTopicID": , "contextLength": , "noisyQueryCount": , "context": [ { "query": , "onTask": [true|false], "onTaskPrediction": } ] "recommendations": [ { "beta": , "lambda": , "model": ["decay"|"hardTask"|"firmTask1"|"firmTask2"|"softTask"] "queries": [ { "query": , "score": }, ... ] }, ... ] } The context array is sorted by submission order (i.e., the first one was the first submitted); as such. the last one is the reference query. The recommendations[*]["queries"] array is sorted in non-ascending order of score. Only the top five (5) recommendations are given. To use the TREC qrels, consider the `trecSessionYear` and `trecSessionTopicID` fields. [1] Task-Aware Query Recommendation (Feild & Allan, SIGIR 2013); URL: http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1091