Million Query (1MQ) Track Guidelines
TREC 2007

V1.1, June 15.  Indicate that 5 runs can be submitted, that 1000 documents are expected, and what to do if you have no documents for a query.
V1.0, May 31.  Finalized guidelines based on email discussion.  In particular updated description of document selection strategy and mentioned "multiple judging" issue.
V0.4, May 29.  Added remaining links.  Proposed change to document selection strategy.
V0.3, May 22.  Added link to submission page for runs.
V0.2, May 2.  Many more details in place.
V0.1, April 3

See the track's web site at http://ciir.cs.umass.edu/research/million for up-to-date information.

The million query track serves two purposes.  First, it is an exploration of ad-hoc retrieval on a large collection of documents.  Second, it investigates questions of system evaluation, particularly whether it is better to evaluation using many shallow judgments or fewer thorough judgments.

Participants in this track will have two tasks: (1) run 10,000 queries against a 426Gb collection of documents at least once and (2) judge documents for relevance with respect to some number of queries.  The first part of the task will be completed by mid-June.  The second part will run over the course of the summer.  Groups that run queries are obligated to perform judgments.  Individuals or groups are welcome to help judge queries regardless of whether they submitted queries.

Important dates

How to participate

To participate in the Million Query track, you must apply to do so with NIST.  (The official deadline is in mid February, but late applications are sometimes accepted.)  Note that applying is not an onerous process and it is unusual not be accepted.  However, you must apply if you wish to participate, and you must participate (in some track) if you wish to attend the TREC meeting in November.

Participants in this track will be expected to run a large number of queries against a collection of documents, to return a ranked list of documents for judging, and to help with the process of judging documents.  The track will include a small trial run in the spring followed by the full, official evaluation in late June (specific dates are above).

Corpus

This track will use the so-called "terabyte" or "GOV2" collection of documents.  This corpus is a collection of Web data crawled from Web sites in the .gov domain in early 2004.    The collection is believed to include a large proportion of the .gov pages that were crawlable at that time, including HTML and text, plus the extracted text of PDF, Word, and PostScript files.  Any document longer than 256Kb was truncated to that size.  Binary files are not included as part of the collection, though were captured separately for use in judging.  

The GOV2 collection includes 25 million documents in 426 gigabytes.  The collection is available from the University of Glasgow, distributed on a hard disk that will be shipped to you (you get to keep the disk).  Glasgow charges £600 (just shy of US$1200 in early May 2007) to cover the cost of  preparing and shipping the data.  Assuming no unusual problems, turn-around time is fairly quick, but do not wait until a few days before the deadlines.  We know of one situation where it took almost 10 weeks for the data to arrive (it is not clear why).  Full details on how to acquire the GOV2 collection are available here.

Queries

Topics for this task will be drawn from a large collection of queries that were collected by a large Internet search engine.  Each of those queries is likely to have at least one relevant document in the GOV2 collection because logs showed a clickthrough on one page captured by GOV2.  Obviously there is no guarantee that the clicked page is relevant, but it increases the chance of the query being appropriate for the collection.

These topics are short, title-length (in TREC parlance) queries.  In the judging phase, they will be developed into full-blown TREC topics.

A small number of queries (100) have been selected for the trial run.  They are the first 100 queries from the 2006 Terabyte Track's efficiency run.  

Ten thousand (10,000) queries will be selected for the official run.  The 10,000 queries will include some queries that were judged in the context of the Terabyte Track from earlier years.

Note that no quality control will be imposed on the 10,000 selected queries.  We hope that most of them are good quality queries, but some are likely to be partially or entirely non-English, some may contain spelling errors, some may be incomprehensible to anyone other than the person who originally created it.

The queries will be distributed in a text file where each line has the format "N:query word or words".  Here, N is the query number, is followed by a colon, and immediately followed by the query itself.  For example, the line (from the training file) "32:barack obama internships" means that query number 32 is the 3-word query "barack obama internships".  All queries are in lowercase and have no punctuation (it is not clear whether that is a result of processing or because people use lowercase and do not use punctuation).

Evaluation measures

The primary measure will be mean average precision (MAP) from a set of ad-hoc topics that were more deeply judged in past TREC tracks.  In some sense, this measure will represent "truth" even though it is unlikely that those judgments are complete.  

A core issue in the track is determining whether system effectiveness can be compared using shallow judgments on many topics.  Measures such as bpref, inferred AP, estimated AP, and so on, will be explored as part of the track's evaluation.  (Most of these measures are discussed in papers describing the sampling methods.  Links to those papers are below.)

Relevance judgments and judging

Judging will be done by assessors at NIST and by participants in the track.  Non-participants are welcome (encouraged!) to provide judgments, too, though the track cannot count on that.

The process will look roughly like this from the perspective of someone judging:
  1. The assessment system will present 5-10 queries randomly selected from the evaluation set.  (That is, from the 10,000 queries in the official evaluation or from the smaller set otherwise.)
  2. The assessor will select one of those five queries to judge.  The others will be returned to the pool.
  3. The assessor will provide the description and narrative parts of the query, creating a full TREC topic.  This information will be used by the assessor to keep focus on what is relevant.
  4. The system will present a GOV2 document (Web page) and ask whether it is relevant to the query.  Judgments will be on a three-way scale to mimic the Terabyte Track: highly relevant, relevant, or not relevant.  Consistent with past practice, the distinction between the first two will be up to the assessor.
  5. Documents will be presented until "enough" have been judged (see the next section).  At that point the assessor will have the option to stop, but may continue if he or she would like. 
The system for carrying out these judgments is being built at UMass and will be tested in the trial run.

Selection of documents for judging

Two approaches to selecting documents will be used in TREC 2007:
  1. Expected AP method.  In this method, documents are selected by how much they inform us about the difference in mean average precision given all the judgments that were made up to that point.  Because average precision is quadratic in relevance judgments, the amount each relevant document contributes is a function of the total number of judgments made and the ranks they appear at. Nonrelevant documents also contribute to our knowledge:  if a document is nonrelevant, it tells us that certain terms cannot contribute anything to average precision.  We quantify how much a document will contribute if it turns out to be relevant or nonrelevant, then select the one that we expect to contribute the most. A SIGIR 2006 paper by Carterette, Allan, and Sitaraman, describing this approach is available in the ACM Digital Library or here.
  2. Statistical evaluation method.  This method draws and judges a specific random sample of documents from the given ranked lists and produces unbiased, low-variance estimates of average precision, R-precision, and precision at standard cutoffs from these judged documents.  Additional (non-random) judged documents may also be included in the estimation process, further improving the quality of the estimates.  For more details, here is a working draft of a paper by Aslam, Pavlu, and Yilmaz, describing this approach (note that it is a working draft, so may change periodically).  
For each query, one of the following will happen:
  1. The pages to be judged for the query will be selected by the "expected AP method."  A minimum of 40 documents will be judged, though the assessor may continue beyond 40 if so motivated.
  2. The pages to be judged for the query will be selected by the "statistical evaluation method."  A minimum of 40 documents will be judged, though the assessor may continue beyond 40 if so motivated.
  3. The pages to be judged will be selected by alternating between the two methods until each has selected 20 pages.  If a page is selected by more than one method, it will be presented for judgment only once.  If the lists overlap, the process will continue until at least 40 pages have been judged, though the assessor may continue beyond 40 if so motivated.
The assignments will be made such that option (3) is selected half the time and the other two options each occur 1/4 of the time.  When completed, half of the queries will therefore have parallel judgments of 20 or more pages by each method, and the other half will have 40 or more judgments by a single method.

In addition, a small pool of 50 queries will be randomly selected for multiple judging.  With a small random chance, the assessor's 5-10 queries will be drawn from that pool rather than the full pool.  Whereas in the full pool no query will be considered by more than one person, in the multiple judging pool, a query can be considered by any or even all assessors--though no assessor will be shown the same query more than once.

Various rules

  1. A manual run is one in which a person is somehow involved in the process of converting a query into a ranked list, whether by formulating the query by hand, modifying the query by hand, or adjusting the ranked list by hand.   With 10,000 queries, it is unlikely that there will be many manual runs.
  2. A system may not be modified in light of the set of queries sent.  You should not look at the evaluation queries before you are ready to run your system.

Trial run protocol

  1. Download the 100 trial run queries.  They are here as a text file.   (Also available at NIST.)
  2. Run those queries to generate 100 ranked lists and put them in NIST standard format.  You can verify the format of your data using a script provided by NIST.
  3. Upload the ranked lists to NIST (you need the TREC password).  
  4. Register with the judging system.
  5. When prompted, use the judging system to mark retrieved documents as relevant or not.

Official run protocol

  1. Download the 10,000 queries from the TREC active participants web site (you need the TREC password).
  2. Run those queries to generate 10,000 ranked lists and put them in NIST standard format.  You can verify the format of your data using a script provided by NIST.  (Use the script: NIST will reject a submission that doesn't pass the script, and you'll end up doing the upload again.)
  3. Upload the ranked lists to NIST (you need the TREC password).  
  4. Register with the judging system.
  5. When prompted, use the judging system to mark retrieved documents as relevant or not.
You may submit up to five runs for judging, ranked in order of preference that they be included.  If possible, all runs will be included in the selection process, but if they cannot, the higher ranked runs will be selected first.  Each run should have 1,000 documents ranked, though may have fewer.  If you are returning zero documents for a query, instead return the single document "GX000-00-0000000".

Submission format

The standard NIST format for submissions is a single ASCII text file.  White space is used to separate colums.  The width of the columns is not important but you must have exactly six columns per line with at least one space between the colums.  For example,

100 Q0 ZF08-175-870 1 9876 mysys1
100 Q0 ZF08-306-044 2 9875 mysys2

The contents of the columns is:
  1. The first column is the topic number.
  2. The second column is unused but must always be the string "Q0" (letter Q, number zero).
  3. The third column is the official document number of the retrieved document, found in the <DOCNO> field of the document.
  4. The fourth column is the rank of that document for that query.  Within a query, each of the numbers from 1 to 1000 should appear exactly once.  (If you retrieve fewer than 1000 documents, then you will have the numbers from 1 to that number.)
  5. The fifth column is the score your system generated to rank this document, either as an integer or a floating point number.  Scores must be in descending order.  Note that typical TREC evaluations use this column, not the rank column, to evaluate systems.  If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
  6. The six column is your "run tag" and is a unique identifier for your group and the particular run.  Please change the tag from year to year, track to track, and run to run, so that different approaches can be compared.  Run tags may contain 12 or fewer letters and numbers with no punctuation (and no white space, or the line would have more than six columns).
Your submission should be sorted by rank within topic number.

If you would normally return no documents for a query, instead return the single document "GX000-00-0000000" at rank one.  Doing so maintains consistent evaluation results (averages over the same number of queries) and does not break anyone's tools.

For the Million Query Track, the submission file will be quite large (up to 10,000,000 lines of text).  All submission files must therefore be compressed before being uploaded.  Use either gzip or bzip2 for the compression.

Plan on your uploads taking a several hours.  Some tips from experienced submitters:
  1. Run the check_mq.pl program before you try to upload a large file.  NIST will reject your submission if it does not pass the checking program, so you should try it yourself first.
  2. Make sure you fill out the submission page information completely and in the correct form.  NIST will upload your run before validating the information in the form.  If you made a mistake, you will have to upload the data again.