Million Query (1MQ) Track Guidelines
TREC 2007
V1.1, June 15. Indicate that 5 runs can be submitted, that 1000
documents are expected, and what to do if you have no documents for a
query.
V1.0, May 31. Finalized guidelines based on email discussion.
In particular updated description of document selection strategy
and mentioned "multiple judging" issue.
V0.4, May 29. Added remaining links. Proposed change to document selection strategy.
V0.3, May 22. Added link to submission page for runs.
V0.2, May 2. Many more details in place.
V0.1, April 3
See the track's web site at http://ciir.cs.umass.edu/research/million for up-to-date information.
The million query track serves two purposes. First, it is an
exploration of ad-hoc retrieval on a large collection of documents.
Second, it investigates questions of system evaluation,
particularly whether it is better to evaluation using many shallow
judgments or fewer thorough judgments.
Participants in this track will have two tasks: (1) run 10,000 queries
against a 426Gb collection of documents at least once and (2) judge
documents for relevance with respect to some number of queries.
The first part of the task will be completed by mid-June. The second part will run over the course of the
summer. Groups that run queries are obligated to perform
judgments. Individuals or groups are welcome to help judge
queries regardless of whether they submitted queries.
Important dates
- May 25 trial run submissions due
- May 31, TREC-imposed deadline for guidelines to be finalized
- June 19, official run submissions due
- October 1, relevance judgments will be released by this date
at the latest (it is more likely that it will be in late August or
early September)
- mid-October, notebook papers will be due
- November 6-9, 2007, TREC 2007 in Gaitherburg, Maryland (you must participate in some track to attend).
How to participate
To participate in the Million Query track, you must apply to do so with NIST.
(The official deadline is in mid February, but late applications
are sometimes accepted.) Note that applying is not an onerous
process and it is unusual not be accepted. However, you must
apply if you wish to participate, and you must participate (in some track) if you wish
to attend the TREC meeting in November.
Participants in this track will be expected to run a large number of
queries against a collection of documents, to return a ranked list of
documents for judging, and to help with the process of judging
documents. The track will include a small trial run in the spring
followed by the full, official evaluation in late June (specific dates
are above).
Corpus
This track will use the so-called "terabyte" or "GOV2" collection of
documents. This corpus is a collection of Web data crawled from
Web sites in the .gov domain in early 2004. The collection
is believed to include a large proportion of the .gov pages that were
crawlable at that time, including HTML and text, plus the extracted
text of PDF, Word, and PostScript files. Any document longer than
256Kb was truncated to that size. Binary files are not included
as part of the collection, though were captured separately for use in judging.
The GOV2 collection includes 25 million documents in 426 gigabytes.
The collection is available from the University of Glasgow,
distributed on a hard disk that will be shipped to you (you get to keep
the disk). Glasgow charges £600 (just shy of US$1200 in
early May 2007) to cover the cost of preparing and shipping the
data. Assuming no unusual problems, turn-around time is fairly
quick, but do not wait until a few days before the deadlines.
We know of one situation where it took almost 10 weeks for the
data to arrive (it is not clear why). Full details on how to
acquire the GOV2 collection are available here.
Queries
Topics for this task will be drawn from a large collection of queries
that were collected by a large Internet search engine. Each of those queries is likely to
have at least one relevant document in the GOV2 collection because logs
showed a clickthrough on one page captured by GOV2. Obviously
there is no guarantee that the clicked page is relevant, but it
increases the chance of the query being appropriate for the collection.
These topics are short, title-length (in TREC parlance) queries.
In the judging phase, they will be developed into full-blown TREC
topics.
A small number of queries (100) have been selected for the trial run.
They are the first 100 queries from the 2006 Terabyte Track's
efficiency run.
Ten thousand (10,000) queries will be selected for the official run.
The 10,000 queries will include some queries that were judged in
the context of the Terabyte Track from earlier years.
Note that no quality control will be imposed on the 10,000 selected
queries. We hope that most of them are good quality queries, but
some are likely to be partially or entirely non-English, some may
contain spelling errors, some may be incomprehensible to anyone other
than the person who originally created it.
The queries will be distributed in a text file where each line has the format "N:query word or words". Here, N
is the query number, is followed by a colon, and immediately followed
by the query itself. For example, the line (from the training
file) "32:barack obama internships" means that query number 32 is the
3-word query "barack obama internships". All
queries are in lowercase and have no punctuation (it is not clear
whether that is a result of processing or because people use lowercase
and do not use punctuation).
Evaluation measures
The primary measure will be mean average
precision (MAP) from a set of ad-hoc topics that were more deeply
judged in past TREC tracks. In some sense, this measure will
represent "truth" even though it is unlikely that those judgments are
complete.
A core issue in the track is determining whether system effectiveness
can be compared using shallow judgments on many topics. Measures
such as bpref, inferred AP, estimated AP, and so on, will be explored
as part of the track's evaluation. (Most of these measures are
discussed in papers describing the sampling methods. Links to
those papers are below.)
Relevance judgments and judging
Judging will be done by assessors at NIST and by participants in the
track. Non-participants are welcome (encouraged!) to provide
judgments, too, though the track cannot count on that.
The process will look roughly like this from the perspective of someone judging:
- The assessment system will present 5-10 queries randomly selected
from the evaluation set. (That is, from the 10,000 queries in the
official evaluation or from the smaller set otherwise.)
- The assessor will select one of those five queries to judge. The others will be returned to the pool.
- The assessor will provide the description and narrative
parts of the query, creating a full TREC topic. This information
will be used by the assessor to keep focus on what is relevant.
- The system will present a GOV2 document (Web page) and ask
whether it is relevant to the query. Judgments will be on a
three-way scale to mimic the Terabyte Track: highly relevant, relevant,
or not relevant. Consistent with past practice, the distinction
between the first two will be up to the assessor.
- Documents will be presented until "enough" have been judged (see the next section). At
that point the assessor will have the option to stop, but may continue
if he or she would like.
The system for carrying out these judgments is being built at UMass and
will be tested in the trial run.
Selection of documents for judging
Two approaches to selecting documents will be used in TREC 2007:
- Expected AP method. In
this method, documents are selected by how much they inform us about
the difference in mean average precision given all the judgments that
were made up to that point. Because average precision is
quadratic in relevance judgments, the amount each relevant document
contributes is a function of the total number of judgments made and the
ranks they appear at. Nonrelevant documents also contribute to our
knowledge: if a document is nonrelevant, it tells us that certain
terms cannot contribute anything to average precision. We
quantify how much a document will contribute if it turns out to be
relevant or nonrelevant, then select the one that we expect to
contribute the most. A SIGIR 2006 paper by Carterette, Allan, and
Sitaraman, describing this approach is available in the ACM Digital Library or here.
- Statistical evaluation method. This method draws and judges a specific random sample of documents from the given
ranked lists and produces unbiased, low-variance estimates of average
precision, R-precision, and precision at standard cutoffs from these
judged documents. Additional (non-random) judged documents may also be
included in the estimation process, further improving the quality of
the estimates.
For more details, here is a working draft of a paper by Aslam, Pavlu, and Yilmaz, describing this approach (note that it is a working draft, so may change periodically).
For each query, one of the following will happen:
- The pages to be judged for the query will be selected by the
"expected AP method." A minimum of 40 documents will be judged,
though the assessor may continue beyond 40 if so motivated.
- The pages to be judged for the query will be selected by the
"statistical evaluation method." A minimum of 40 documents will
be judged, though the
assessor may continue beyond 40 if so motivated.
- The pages to be judged will be selected by alternating between
the two methods until each has selected 20 pages. If a page is
selected by more than one method, it will be presented for judgment
only once. If the lists overlap, the process will continue until
at least 40 pages have been judged, though the assessor may continue
beyond 40 if so motivated.
The assignments will be made such that option (3) is selected half the
time and the other two options each occur 1/4 of the time. When
completed, half of the queries will therefore have parallel judgments
of 20 or more pages by each method, and the other half will have 40 or
more judgments by a single method.
In addition, a small pool of 50 queries will be randomly selected for multiple
judging. With a small random chance, the assessor's 5-10 queries
will be drawn from that pool rather than the full pool. Whereas
in the full pool no query will be considered by more than one person,
in the multiple judging pool, a query can be considered by any or even
all assessors--though no assessor will be shown the same query more
than once.
Various rules
- A manual run
is one in which a person is somehow involved in the process of
converting a query into a ranked list, whether by formulating the query
by hand, modifying the query by hand, or adjusting the ranked list by
hand. With 10,000 queries, it is unlikely that there will be
many manual runs.
- A system may not
be modified in light of the set of queries sent. You should not
look at the evaluation queries before you are ready to run your system.
Trial run protocol
- Download the 100 trial run queries. They are here as a text file. (Also available at NIST.)
- Run those queries to generate 100 ranked lists and put them
in NIST standard format. You can verify the format of your data
using a script provided by NIST.
- Upload the ranked lists to NIST (you need the TREC password).
- Register with the judging system.
- When prompted, use the judging system to mark retrieved documents as relevant or not.
Official run protocol
- Download the 10,000 queries from the TREC active participants web site (you need the TREC password).
- Run those queries to generate 10,000 ranked lists and put
them in NIST standard format. You can verify the format of your
data using a script provided by NIST. (Use the script: NIST will reject a submission that doesn't pass the script, and you'll end up doing the upload again.)
- Upload the ranked lists to NIST (you need the TREC password).
- Register with the judging system.
- When prompted, use the judging system to mark retrieved documents as relevant or not.
You may submit up to five runs for judging, ranked in order of
preference that they be included. If possible, all runs will be
included in the selection process, but if they cannot, the higher
ranked runs will be selected first. Each run should have 1,000
documents ranked, though may have fewer. If you are returning
zero documents for a query, instead return the single document
"GX000-00-0000000".
Submission format
The standard NIST format for submissions is a single ASCII text file.
White space is used to separate colums. The width of the
columns is not important but you must have exactly six columns per line
with at least one space between the colums. For example,
100 Q0 ZF08-175-870 1 9876 mysys1
100 Q0 ZF08-306-044 2 9875 mysys2
The contents of the columns is:
- The first column is the topic number.
- The second column is unused but must always be the string "Q0" (letter Q, number zero).
- The third column is the official document number of the retrieved document, found in the <DOCNO> field of the document.
- The fourth column is the rank of that document for that query.
Within a query, each of the numbers from 1 to 1000 should appear
exactly once. (If you retrieve fewer than 1000 documents, then
you will have the numbers from 1 to that number.)
- The fifth column is the score your system generated to rank this
document, either as an integer or a floating point number. Scores
must be in descending order. Note that typical TREC evaluations
use this column, not the rank column, to evaluate systems. If you
want the precise ranking you submit to be evaluated, the scores must
reflect that ranking.
- The six column is your "run tag" and is a unique identifier for
your group and the particular run. Please change the tag from
year to year, track to track, and run to run, so that different
approaches can be compared. Run tags may contain 12 or fewer
letters and numbers with no punctuation (and no white space, or the
line would have more than six columns).
Your submission should be sorted by rank within topic number.
If you would normally return no documents for a query, instead return
the single document "GX000-00-0000000" at rank one. Doing so
maintains consistent evaluation results (averages over the same number
of queries) and does not break anyone's tools.
For the Million Query Track, the submission file will be quite large
(up to 10,000,000 lines of text). All submission files must
therefore be compressed before being uploaded. Use either gzip or bzip2 for the compression.
Plan on your uploads taking a several hours. Some tips from experienced submitters:
- Run the check_mq.pl program before you try to upload a large
file. NIST will reject your submission if it does not pass the
checking program, so you should try it yourself first.
- Make sure you fill out the submission page information completely and in the correct form. NIST will upload your run before validating the information in the form. If you made a mistake, you will have to upload the data again.