HARD, High Accuracy Retrieval from Documents
Guidelines (TREC 2005)


Change history:
Please check the official HARD Web site (http://ciir.cs.umass.edu/research/hard) to ensure that you are looking at the latest guidelines. 

This page is the authoritative source of information on the TREC 2005 HARD track.  You can jump directly to details on:


Summary of changes from TREC 2004's track

  1. There is no passage retrieval evaluation as part of the track this year.  The format of submitted runs will take that into account.
  2. The corpus will be the full AQUAINT collection.  In HARD 2003 it used part of AQUAINT plus additional documents.  In HARD 2004 it was a collection of news from 2003 collated especially for HARD.
  3. The topics will be selected from existing TREC topics.  They will not include metadata, with the exception of an "expertise" field.
  4. There will be no notion of "hard relevance" and "soft relevance".  Documents will be judged for relevance.
  5. Clarification forms can be much more complex this year.
  6. The official evaluation measure will possibly be different, but is not yet resolved.
  7. The submission format for runs does not include passage offset and length.  Instead, the format is the "classic" TREC submission format.


Corpus and topics

The test collection will be the AQUAINT corpus.  Instructions for acquiring this corpus were sent with the "welcome" email sent to participants by NIST.  In a nutshell, you need to fill out the AQUAINT organization permission form available in the "Forms" part of the active participants' part of the TREC web site and send it to NIST.  They'll ship you the data, assuming all is well.  (You also need to have filled out the agreement concening dissemination of TREC results.)


Topics will be the same topics used by the Robust track.  There will be 50 topics culled from two existing sets of topics: (1) a set of 50 topics designated as difficult and (2) a set of 25-50 topics that have had low precision scores in the robust track.  The topics will be culled to ensure that they are likely to have some but not too many relevant documents in the AQUAINT corpus.  There will be no effort to ensure that the topics vary along any dimension.  Note that this is in contrast to earlier HARD tracks where topics had varying subject, geographical focus, and so on.

WARNING.  These queries have appeared in TREC before and already have relevance judgments for TREC corpora.  They do not have any relevance judgments for the AQUAINT corpus.  You should do your best to avoid training your HARD system on those topics.  In particular, you should not build a system that explicitly uses those earlier topics and their relevance judgments to improve your results in HARD.  If you choose to do that anyway, you must mark your submission as a supervised run when you submit it.

To provide some connection to past HARD tracks, the assessors will indicate their level of experience with the topics they are judging.  To do that, they answer the following question:

Record your familiarity with the topic using a number between 1 and 7 inclusive, where 1 stands for "I don't even understand what the topic means" and 7 stands for "I am a world-class expert on this subject".

The topic format may unfortunately be a mix of formats since we are using old topics and since NIST prefers to keep them in that format.  For examples of the TREC topic format, see the TREC 2004 Robust test set on the TREC English test questions file list.  (No, they are not XML-compliant.  A backward step from last year.  Sorry.)


Site codes

Each site will be assigned a 4-character code that must be used as part of all submissions.  The purpose of the assigned site code is to ensure that the submission and evaluation process can match up baseline runs, clarification forms, and final runs.  When codes are assigned, participants will be notified by email.


Baseline run

NIST wil be able to include one baseline run (and one final run) per site in the judging pool.  Be careful that you select your preferences properly when you submit your baseline run.

Format of submissions

Submissions must be provided in a single file that contains a ranked list of no more than 1000 documents for each of the 50 topics in the HARD test set run against the AQUAINT document collection.  Each line of the file is separated by white space.  The width of the columns is not important, but it is important that you have exactly six columsn per line with at least one space between the columns.  The format is,

topic-num  Q0  document-id  rank  score  run-tag

where:
Here is an example showing a few lines form a hypothetical submission showing what something might look like (this is not a HARD submission example):

630 Q0 NYT19990430.0001  1 4238 prise1
630 Q0 APW20000805.0004  2 4223 prise1
630 Q0 XIE19971213.0003  3 4207 prise1
630 Q0 NYT19980830.0021  4 4194 prise1
630 Q0 APW19981105.0054  5 4289 prise1

Each topic must have at least one document retrieved for it.  Provided you have a least one document, you may return fewer than 1,000 documents for a topic.  However, note that the standard evaluation measures used in TREC count empty ranks as not relevant.  You cannot hurt your score and you could conceivably improve it for those measures, by returning 1,000 documents per topic.  (Measures such as R-precision only consider the top R documents, but we may explore other measures as part of the track.)

As was the case last year, it is unlikely that sites will be able to submit more than 10 runs across baseline and final runs.  Please plan accordingly.

NIST creates routines that check for common errors in the result files (e.g., duplicate document numbers for the same topic, invalid document numbers, wrong format, multiple tags within a run).  That routine will be made available to participants to check their runs for errors prior to submitting them.  Submitting runs is an automatic process done through a web form, and runs that contain errors cannot be processed.

Where and how to submit

To submit a baseline run, go to the TREC submissions page for HARD (or get there from the general submission page).  Be sure you have selected "baseline run" as your submission type.  You can upload a single run's submission at a time by specifying a file and providing the following information:
  1. Was this an entirely automatic run or a manual run?
  2. Did you use the title, description, and/or narrative fields for this run?To what extent did you use earlier relevance judgments on the topics?
  3. A short description of the run that can be used by the track coordinator to understand what happened and describe it for the track report.  These will not be looked at until judgments have been returned.
  4. What is the preference in terms of judging of this run?  One baseline run and one final run from each site will be included in the pool.  Preferences beyond one are asked on the very unlikely chance that more runs can be included.
The final run instructions are similar but the questions are slightly different.

NOTE: Please see NIST's message about submissions to avoid common mistakes.  It is available as text or email.

Clarification forms

Format of and restrictions on forms

The clarification forms will be filled out by the topic assessors at NIST.  They will be using the following platform:
You may submit almost any type of clarification form that you like, including Javacript, Java, images, or the like.  The following restrictions apply:
It is your responsibility to ensure that it will run properly on the described environment.  Unfortunately, we have no mechanism to validate clarification forms, so you may want to avoid complex forms that might fail because of an unanticipated configuration glitch.

Your clarification form must include the following items:
In addition, you are strongly encouraged to include somewhere on the page the topic number (e.g., "001") and the title of the topic.  The purpose of including this is to provide a sanity check that the annotators are, indeed, answering the correct questions. 

Two clariification forms from each site will be filled out.  If participation and time permits, NIST will consider filling out additional forms from each site, but it is unlikely that will happen.  The naming convention on forms will make it clear the order in which you want the forms to be filled out.  For example, XXXX1 then XXXX2 then, if time permits, XXXX3.

For each submission, put all of your clarification forms in a single directory (folder) with the name indicated (e.g., NIST1).  Each clarification form inside that directory should also be a directory with the name of the submission and the topic number (e.g., NIST1_043 for topic 43 of the NIST1 submission).  Note that the topic number must be 0-filled to three digits.

Inside that directory, the main clarification form should be called index.html.  It may access any files from within your directory hierarchy, using relative pathnames.  For example, "logo.gif" would refer to the file NIST1/NIST1_043/logo.gif within the directory structure, and "../logo.gif" would refer to NIST1/logo.gif".  Do not refer to any files outside of your directory structure.  Do not refer to files with absolute path names in the URL since (1) the absolute names are not known and (2) there is no access to files on the network.

Where and how to submit

Create a tar file that contains all of the forms for a run and optionally gzip the file.  The file should be named XXXXn.tar (or XXXXn.tar.gz) where XXXX is your site code and n is the submission number (the order in which the clarification forms should be included).  The tar file should contain exactly 50 directories with names as described above. 

To submit your clarification forms, go to the TREC submissions page for HARD (or get there from the general submission page).  Be sure to select "clarification form" as the submission type.  That will cause the following questions to appear: 
  1. Did you use clustering to generate this form?
  2. Did you use text summarization, either extractive or generative?
  3. Did you use document-level feedback?  That is, did you ask the user to judge an entire document for relevance, even if you did so using a title, passage, or keywords from the document?
  4. Did you ask the user to judge selected passages of text, independent of the documents they came from?
  5. Did you ask the user to judge keywords for relevance, independent of the documents they came from?
  6. If you used any techniques not listed above, briefly list them at the bullet-list level of detail.
  7. Did you use any sources of information beyond the query and AQUAINT corpus and, if so, what were they?
This information will be used primarily by the track coordinator in generating a report on the track and will not be looked at until after all final runs have been submitted (at the earliest).

NOTE: Please see NIST's message about submissions to avoid common mistakes.  It is available as text or email.

How forms will be used

The assessors will spend no more than three minutes per form no matter how complex your form is.  The three minutes includes time needed to load the form (from local disk since there is no network access), initialize it, and do any rendering, so unusually complex or large forms will be implicitly penalized.  At the end of three minutes, if the assessor has not pressed the "submit" button, the form will be timed out and forcibly submited (anything entered up to that point should be saved).  If your form somehow actively prevents the submission from happening at the end of three minutes, the form will be rejected, no further forms from that submission will be processed, and you will receive no clarification responses from that submission.  Note that this implies you should not have entry-validation code that prevents the submit button from being pressed.  A validation phase that asks the assessor to re-edit or "submit anyway" is acceptable, since it does not force the annotator to spend more than three minutes.

NIST recorded the time spent on the form returned for each form.  That information was returned in a separate file along with all of the clarification form responses.   Assessors were never permitted more than 180 seconds per form, but some of the reported times were greater than 180 because of the time it took for the system to "shut down" a form if the time limit expired.

Clarification forms will be presented to annotators in an order to minimize the chance that one form will adversely (or positively) impact the use of another form.  Here is the rotation that was used for the submitted clarification forms (graciously generated by UNC in very little time).  The rows of the table correspond to topics and the columns to clarification forms from sites.  For example, the form indicates that NCAR's primary clarification form (NCAR1) will be the 28th considered for topic 1, the 29th for topic 2, ..., the 1st for topic 8, and so on.  Similar, for topic 1, the assessor first did INDI1's form, then that for CASP1, then UIUC1's, followed by MEIJ1's, and so on.

Format of returned information

The results will be returned as one tar file per submission.  The file XXXXn_responses.tar will contain exactly 50 files with names of the format XXXXn_000 where XXXX is the site code, n is the submission number, and 000 represents the topic number.  You will receive one tar file per submission that was answered by the assessors.

The resulting tarball will be emailed to the contact person in the submission once all clarification forms are processed.


Final runs

After you have received the clarification form responses, you should re-run the queries with that extra information.  Your final runs will be new ranked lists of documents that are, we hope, better than the baseline runs.

As was the case last year, it is unlikely that sites will be able to submit more than 10 runs across baseline and final runs.  Please plan accordingly.

NIST will be able to include one final run (and one baseline run) per site in the judging pool.  Be careful that you select your preferences properly when you submit your baseline run.

Where and how to submit

Use the same format for these runs as you used for the baseline runs.

To submit a run, go to the TREC submissions page for HARD (or get there from the general submission page).  Be sure to select "final run" as the submission type to get the following questions:
  1. Which of your baseline runs is an appropriate baseline?  That is, in considering percent improvement from a baseline, what is the starting point?   It is possible for the answer to this question to be "none".
  2. Which of your clarification forms was used to generated this final run?  It is plausible that a final run could ignore clarification forms, in which case the answer could be "none".  It is also plausible that a final run could integrate information from multiple clarification forms, in which case the answer will be "none".
  3. Other than the clarification form's being answered, was this an entirely automatic run or a manual run?
  4. Did you use the title, description, and/or narrative fields for this run?
  5. To what extent did you use earlier relevance judgments on the topics?
  6. A short description of the run that can be used by the track coordinator to understand what happened and describe it for the track report.  These will not be looked at until judgments have been returned.
  7. What is the preference in terms of judging of this run?  One baseline run and one final run from each site will be included in the pool.
The baseline run instructions are similar but the questions are slightly different.

NOTE: Please see NIST's message about submissions to avoid common mistakes.  It is available as text or email.


Training topics and corpus

The following data collections from TREC 2003 and 2004 are available for training.  All of this data is being made available to HARD track participants by courtesy of the Linguistic Data Consortium.  Because this is a courtesy on their part, please do not ask for copies of corpora that you already have.  The document collections will be shipped on DVD.  The topics, relevance judgments, and clarification forms will be made available by ftp.

The corpora will be provided for use only in this evaluation with the expectation that they will be destroyed at the completion of the track (i.e., after your final papers are written).  You may be able to arrange to keep the collections longer: the LDC is likely to release these as collections to its members, and you may be able to arrange longer-term rights to use the data in other ways.

If you are participating in this year's track, you were asked to indicate that information in early May.  The sites indicating active interest were provided to the LDC and they contacted those sites directly to make arrangements necessary.

TREC 2004.
TREC 2003.

Evaluation

System output will be evaluated by R-precision (precision at R documents retrieved, where R is the number of known relevant documents in the collection).   This will be used as the "official" measure for the track.

We would also like to explore improvements due to clarification form between the baseline and the final runs.  To do that, final run submissions will be asked to indicate which baseline submission (if any) corresponds, and which clarification form (if any) corresponds.  Track summary information will show the gains over the baseline.   (The gain is not used as an "official" measure because it can too easily be gamed.)

We also hope to explore "gains per unit time" by considering the amount of improvement provided by clarification forms divided by the amount of time spent on the clarification form.  This information will be explored only for final runs that have a corresponding clarification form.


Timeline

Tentative schedule of events: