HARD, High
Accuracy Retrieval from Documents
Guidelines (TREC 2005)
Change history:
- V4, July 25, 5:00pm. Added rotation information describing
the order in which clarification forms were filled out. Also
described how time information was returned.
- V3b, July 6, 8:50pm. Added a pointer to NIST's message
about
submissions.
- V3, July 6, 8:45pm. Added direct pointers to the submission
page
and deleted reference to sample clarification forms.
- V2, June 16, 10:00am. Added text to emphasize that one
baseline and one final run per site will be included in the NIST
judging pool.
- V1, June 2, 5:00pm. Initial release.
Please check the official
HARD Web site (http://ciir.cs.umass.edu/research/hard) to ensure
that you are looking at the latest guidelines.
This page is the authoritative source of information on the TREC 2005
HARD track. You can jump directly to details on:
Summary of changes
from TREC 2004's track
- There is no passage retrieval evaluation as part of the track
this year. The format of submitted runs will take that into
account.
- The corpus will be the full AQUAINT collection. In HARD
2003 it used part of AQUAINT plus additional documents. In HARD
2004
it was a collection of news from 2003 collated especially for HARD.
- The topics will be selected from existing TREC topics. They
will not include metadata, with the exception of an "expertise" field.
- There will be no notion of "hard relevance" and "soft
relevance". Documents will be judged for relevance.
- Clarification forms can be much more complex this year.
- The official evaluation measure will possibly be different, but
is not yet resolved.
- The submission format for runs does not include passage offset
and length. Instead, the format is the "classic" TREC submission
format.
Corpus and topics
The test collection will be the AQUAINT corpus. Instructions for
acquiring this corpus were sent with the "welcome" email sent to
participants by NIST. In a nutshell, you need to fill out the
AQUAINT organization permission form available in the "Forms" part of
the active participants' part of the TREC web site and send it to
NIST. They'll ship you the data, assuming all is well. (You
also need to have filled out the agreement concening dissemination of
TREC results.)
Topics will be the same topics used by the Robust track. There
will be 50 topics culled from two existing sets of topics: (1) a set of
50
topics designated as difficult and (2) a set of 25-50 topics that have
had
low precision scores in the robust track. The topics will be
culled to ensure that they are likely to have some but not too many
relevant documents in the AQUAINT corpus. There will be no effort
to ensure that the
topics vary along any dimension. Note that this is in contrast to
earlier HARD tracks where topics had varying subject, geographical
focus,
and so on.
WARNING. These queries have appeared in TREC before and already
have relevance judgments for TREC corpora. They do not have any
relevance judgments for the AQUAINT corpus. You should do your
best to avoid training your HARD system on those topics. In
particular, you should not build a system that explicitly uses those
earlier topics and their relevance judgments to improve your results in
HARD. If you choose to do that anyway, you must mark your
submission as a supervised run when you submit it.
To provide some connection to past HARD tracks, the assessors will
indicate their level of experience with the topics they are
judging. To do
that, they answer the following question:
Record your familiarity with the topic
using a number between 1 and 7 inclusive, where 1 stands for "I don't
even understand what the topic means" and 7 stands for "I am a
world-class expert on this subject".
The topic format may unfortunately be a mix of formats since we are
using old topics and since NIST prefers to keep them in that
format.
For examples of the TREC topic format, see the TREC 2004 Robust test
set
on the TREC
English
test questions file list. (No, they are not
XML-compliant.
A backward step from last year. Sorry.)
Site codes
Each site will be assigned a 4-character code that must be used as part
of all submissions. The purpose of the assigned site code is to
ensure that the submission and evaluation process can match up baseline
runs,
clarification forms, and final runs. When codes are assigned,
participants
will be notified by email.
Baseline run
NIST wil be able to include one baseline run (and one final run) per
site in the judging pool. Be careful that you select your
preferences properly when you submit your baseline run.
Format of submissions
Submissions must be provided in a single file that contains a ranked
list of no more than 1000 documents for each of the 50 topics in the
HARD test set run against the AQUAINT document collection. Each
line of the file is separated by white space. The width of the
columns is not
important, but it is important that you have exactly six columsn per
line
with at least one space between the columns. The format is,
topic-num
Q0 document-id
rank score run-tag
where:
- topic-num is the topic
number
- Q0 is the query number
within the topic. This field is currently unused but must be
provided and must have the value "Q0" (the letter Q followed by the
number zero).
- document-id is the
official document number of the retrieved document and is the number
found in the "docno" field of the document.
- rank is the rank at
which the document was retrieved for this topic, where the document
most likely to be relevant has rank 1
- score is the score of
that document. The score must
be in descending (non-increasing) order. The evaluation routines
score systems based on the scores, not the ranks. If you want the
precise ranking you submit to be evaluated, the score field must reflect that
ranking (i.e., have no ties).
- run-tag is a unique
identifier for your group and for the method used. This can be
any value, provided it is 12 or fewer characters and does not contain
whitespace or a colon
(:), and provided that all of your submitted runs this year have
different
run tags. You are encouraged to change the tag from year to year
to
help with cross-year comparisons should they ever happen. You may
want to use your site code as part of the run tag, but that is not
required.
Here is an example showing a few lines form a hypothetical submission
showing what something might look like (this is not a HARD submission
example):
630 Q0 NYT19990430.0001 1 4238
prise1
630 Q0 APW20000805.0004 2 4223 prise1
630 Q0 XIE19971213.0003 3 4207 prise1
630 Q0 NYT19980830.0021 4 4194 prise1
630 Q0 APW19981105.0054 5 4289 prise1
Each topic must have at least one document retrieved for it.
Provided you have a least one document, you may return fewer than 1,000
documents for a topic. However, note that the standard evaluation
measures used in TREC count empty ranks as not relevant. You
cannot hurt your score and you could conceivably improve it for those
measures, by returning 1,000 documents per topic. (Measures such
as R-precision only consider the top R documents, but we may explore
other measures as part of the track.)
As was the case last year, it is unlikely that sites will be able to
submit more than 10 runs across baseline and final runs. Please
plan accordingly.
NIST creates routines that check for common errors in the result files
(e.g., duplicate document numbers for the same topic, invalid document
numbers, wrong format, multiple tags within a run). That routine
will be made available to participants to check their runs for errors
prior to submitting them. Submitting runs is an automatic process
done through a web
form, and runs that contain errors cannot be processed.
Where and how to submit
To submit a baseline run, go to the TREC submissions page for
HARD (or get there from the general
submission page). Be sure you have selected "baseline run" as
your submission type. You can upload a single run's submission at
a time by specifying a file and providing the following information:
- Was this an entirely automatic run or a manual run?
- Did you use the title, description, and/or narrative fields for
this run?To what extent did you use earlier relevance judgments on the
topics?
- A short description of the run that can be used by the track
coordinator to understand what happened and describe it for the track
report. These will not be looked at until judgments have been
returned.
- What is the preference in terms of judging of this run? One
baseline run and one final run from each site will be included in the
pool. Preferences beyond one are asked on the very unlikely
chance that more runs can be included.
The final run instructions are similar but the questions are slightly
different.
NOTE: Please see NIST's message about submissions to avoid common
mistakes.
It is available as text or email.
Clarification forms
Format of and restrictions on forms
The clarification forms will be filled out by the topic assessors at
NIST. They will be using the following platform:
- Redhat Enterprise Linux version "3 workstation"
- 20-inch LCD monitor with 1600x1200 resolution, true color
(millions of colors)
- Firefox Web browser, v1.0.3
- Disconnected from all networks of any sort
You may submit almost any
type of clarification form that you like, including Javacript, Java,
images,
or the like. The following restrictions apply:
- The forms will be running on a computer that is disconnected from
all networks, so you must provide all necessary information as part of
the form. If it requires multiple files, they must all be within
the same directory structure. You cannot assume that your other
clarification forms will be on the same computer.
- It will not be possible to invoke any cgi-bin scripts
- It will not be possible to write to disk
It is your responsibility to ensure that it will run properly on the
described environment. Unfortunately, we have no mechanism to
validate clarification forms, so you may want to avoid complex forms
that might fail because of an unanticipated configuration glitch.
Your clarification form must
include the following items:
- <form
action="/cgi-bin/clarification_submit.pl" method="post">
This indicates the script where the output will be generated. You
are welcome to use this cgi URL during development of your form, since
all it does is output the selected information.
- <input type="hidden"
name="site" value="XXXXn">
Here, "XXXX" is a 4-letter code designating your site and "n" is a run
number. The site codes will be provided in the lead-up to the
baseline submission. The run numbers should reflect your priority
order. That is, XXXX1 will be processed then XXXX2 and so on.
If you only have one set of forms, please use 1. For
example, the first submission from NIST would be NIST1.
- <input type="hidden"
name="topicid" value="000">
Indicates the topic number. It should be 3-digit code with zeros
padding as needed. So 001 rather than 01 or 1.
- <input type="submit"
name="send" value="submit">
This is the submit button that should appear somewhere on your page.
In addition, you are strongly encouraged to include somewhere on the
page the topic number (e.g., "001") and the title of the topic.
The purpose of including this is to provide a sanity check that the
annotators are, indeed, answering the correct questions.
Two clariification forms from each site will be filled out. If
participation and time permits, NIST will consider filling out
additional forms from each site, but it is unlikely that will
happen. The naming convention on forms will make it clear the
order in which you want the forms to be filled out. For example,
XXXX1 then XXXX2 then, if time permits, XXXX3.
For each submission, put all of your clarification forms in a single
directory (folder) with the name indicated (e.g., NIST1). Each
clarification form inside that directory should also be a directory
with the name of the submission and the topic number (e.g., NIST1_043
for topic 43 of the NIST1 submission). Note that the topic number
must be 0-filled to three
digits.
Inside that directory, the main clarification form should be called
index.html. It may access any files from within your directory
hierarchy,
using relative pathnames. For example, "logo.gif" would refer to
the file NIST1/NIST1_043/logo.gif within the directory structure, and
"../logo.gif"
would refer to NIST1/logo.gif". Do not refer to any files outside
of your directory structure. Do not refer to files with absolute
path
names in the URL since (1) the absolute names are not known and (2)
there
is no access to files on the network.
Where and how to submit
Create a tar file that contains all of the forms for a run and
optionally gzip the file. The file should be named XXXXn.tar (or
XXXXn.tar.gz) where XXXX is your site code and n is the submission number (the
order in which the clarification forms should be included). The
tar file should contain exactly 50 directories with names as described
above.
To submit your clarification forms, go to the TREC submissions page for
HARD (or get there from the general
submission page). Be sure to select "clarification form" as
the submission type. That will cause the following questions to
appear:
- Did you use clustering to generate this form?
- Did you use text summarization, either extractive or generative?
- Did you use document-level feedback? That is, did you ask
the user to judge an entire document for relevance, even if you did so
using a title, passage, or keywords from the document?
- Did you ask the user to judge selected passages of text,
independent of the documents they came from?
- Did you ask the user to judge keywords for relevance, independent
of the documents they came from?
- If you used any techniques not listed above, briefly list them at
the bullet-list level of detail.
- Did you use any sources of information beyond the query and
AQUAINT corpus and, if so, what were they?
This information will be used primarily by the track coordinator in
generating a report on the track and will not be looked at until after
all final runs have been submitted (at the earliest).
NOTE: Please see NIST's message about submissions to avoid common
mistakes.
It is available as text
or email.
How forms will be used
The assessors will spend no more than
three minutes per form no matter how complex your form is.
The three minutes includes time
needed to load the form (from local disk since there is no network
access), initialize it, and do any rendering, so unusually complex or
large forms will be implicitly penalized. At the end of three
minutes, if the assessor has not pressed the "submit" button, the form
will be timed out and forcibly submited (anything entered up to that
point should be saved). If your form somehow actively prevents
the submission from happening at the end of three minutes, the
form will be rejected, no further forms from that submission will be
processed, and you will receive no clarification responses from that
submission. Note that this implies you should not have
entry-validation code that prevents the submit button from being
pressed. A validation phase that asks
the assessor to re-edit or "submit anyway" is acceptable, since it does
not
force the annotator to spend more than three minutes.
NIST recorded the time spent on the form returned for
each form. That information was returned in a separate file along
with all of the clarification form responses. Assessors
were never permitted more than 180 seconds per form, but some of the
reported times were greater than 180 because of the time it took for
the system to "shut down" a form if the time limit expired.
Clarification forms will be presented to annotators in an order to
minimize the chance that one form will adversely (or positively) impact
the use
of another form. Here is the rotation
that was used for the submitted clarification forms (graciously
generated by UNC in very little time). The rows of the table
correspond to topics and the columns to clarification forms from
sites. For example, the form indicates that NCAR's primary
clarification form (NCAR1) will be the 28th considered for topic 1, the
29th for topic 2, ..., the 1st for topic 8, and so on. Similar,
for topic 1, the assessor first did INDI1's form, then that for CASP1,
then UIUC1's, followed by MEIJ1's, and so on.
Format of returned information
The results will be returned as one tar file per submission. The
file XXXXn_responses.tar will contain exactly 50 files with names of
the format XXXXn_000 where XXXX
is the site code, n is the
submission number, and 000
represents the topic number. You will receive one tar file per
submission that was answered by the assessors.
The resulting tarball will be emailed to the contact person in the
submission once all clarification forms are processed.
Final runs
After you have received the clarification form responses, you should
re-run the queries with that extra information. Your final runs
will be new ranked lists of documents that are, we hope, better than
the baseline runs.
As was the case last year, it is unlikely that sites will be able to
submit more than 10 runs across baseline and final runs. Please
plan accordingly.
NIST will be able to include one final run (and one baseline run) per
site in the judging pool. Be careful that you select your
preferences properly when you submit your baseline run.
Where and how to submit
Use the same format for these runs as you used for the baseline runs.
To submit a run, go to the TREC submissions page for
HARD (or get there from the general
submission page). Be sure to select "final run" as the
submission type
to get the following questions:
- Which of your baseline runs is an appropriate baseline?
That is, in considering percent improvement from a baseline, what is
the
starting point? It is possible for the answer to this
question
to be "none".
- Which of your clarification forms was used to generated this
final run? It is plausible that a final run could ignore
clarification
forms, in which case the answer could be "none". It is also
plausible
that a final run could integrate information from multiple
clarification
forms, in which case the answer will be "none".
- Other than the clarification form's being answered, was this an
entirely automatic run or a manual run?
- Did you use the title, description, and/or narrative fields for
this run?
- To what extent did you use earlier relevance judgments on the
topics?
- A short description of the run that can be used by the track
coordinator to understand what happened and describe it for the track
report.
These will not be looked at until judgments have been returned.
- What is the preference in terms of judging of this run? One
baseline run and one final run from each site will be included in the
pool.
The baseline run instructions are similar but the questions are
slightly different.
NOTE: Please see NIST's message about submissions to avoid common
mistakes.
It is available as text
or email.
Training topics and corpus
The following data collections from TREC 2003 and 2004 are available
for training. All of this data is being made available to HARD
track participants by courtesy of the Linguistic Data Consortium.
Because this is a courtesy on their part, please do not ask for copies
of corpora that you already have. The document collections will
be shipped on DVD. The topics, relevance judgments, and
clarification forms will be made available by ftp.
The corpora will be provided for use only in this evaluation with the
expectation that they will be destroyed at the completion of the track
(i.e., after your final papers are written). You may be able to
arrange
to keep the collections longer: the LDC is likely to release these as
collections to its members, and you may be able to arrange longer-term
rights to use the data in other ways.
If you are participating in this year's track, you were asked to
indicate that information in early May. The sites indicating
active interest were provided to the LDC and they contacted those sites
directly to make arrangements necessary.
TREC 2004.
- The corpus was a set of news from 2003.
- There were 49 topics with several metadata fields.
TREC 2003.
- The corpus was a set of 372,219 documents totally 1.7Gb from the
1999 portion of the AQUAINT corpus, along with some US government
documents from the same year (congressional record and federal
register).
- The topics were somewhat like standard TREC topics, but included
lots of searcher and query metadata.
Evaluation
System output will be evaluated by R-precision (precision at R
documents retrieved, where R is the number of known relevant documents
in the collection). This will be used as the "official"
measure for the track.
We would also like to explore improvements due to clarification form
between the baseline and the final runs. To do that, final run
submissions will be asked to indicate which baseline submission (if
any) corresponds, and which clarification form (if any)
corresponds. Track summary information will show the gains over
the baseline. (The gain is not used as an "official"
measure because it can too easily be gamed.)
We also hope to explore "gains per unit time" by considering the amount
of improvement provided by clarification forms divided by the amount of
time spent on the clarification form. This information will be
explored only for final runs that have a corresponding clarification
form.
Timeline
Tentative schedule
of events:
- Now, training
information and test document collection available
- May 31, track
guidelines established
- June 15, test
topics available
- July
7, baseline runs from systems due
- July 7, clarification forms must be
submitted to NIST
- July 25,
clarification form responses sent to systems from NIST
- August
8, final runs due
- Not later than
October 1, 2005, sites receive relevance judgments ("qrels") and
evaluation results
- Late October,
conference notebook papers due
- November
15-18, TREC conference at NIST in Gaithersburg, Maryland (for TREC
participants only)