HARD, High Accuracy Retrieval from Documents
TREC 2003 track guidelines

Last revised August 17, 2003

Main HARD page .

This page will include or point to all of the information necessary to participate in the HARD track of TREC 2003.  Please address questions of clarification to the HARD mailing list, hard@ciir.cs.umass.edu .  You may prefer to first check the mailing list archives to ensure your question has not been answered.

The creation of the HARD corpus and its evaluation are sponsored by DARPA's TIDES program.
Topic development, interaction, and annotation will be handled by the Linguistic Data Consortium (LDC).

Issues/Concerns

  1. Finalize the passage-level evaluation

Changes to document

Corpus

The evaluation corpus will be a combination of newswire text from the 199 portion of the AQUAINT corpus and U.S. government documents.  The following table provides details on the make-up of the corpus (will be updated as the information is gathered).  All information is from only 1999 because only documents from that year should be included (ignore other years' data in the AQUAINT corpus).


NYT,
APW,
XIE
CR
FR
Totals
1999
137,806
750Mb
Jan-Dec
77,876
245Mb
Jan-Nov
104,698
310Mb
Jan-Dec
16,609
147Mb
Jan-Dec
35,230
330Mb
Jan-Dec
372,219
1.7Gb

The New York Times (NYT), Associated Press Worldstream (APW), and Xinghua English (XIE) articles are all available on the AQUAINT disks.  Those disks are available free-of-charge to all TREC participants.  If you do not already have the disks (they have been used in other TREC tracks), follow the instructions in Ellen Voorhees' message welcoming you to TREC to get them.  (You have to fill out a data use form.)  The AQUAINT disks are also available directly from the LDC (catalogue number LDC2002T31 ) if you purchased the 2002 membership year, or for US$180 for anyone.   NOTE: The documents in the AQUAINT corpus are also available in the LDC's "gigaword" corpus, but should not be used because they are formatted differently and formatting is critical for the passage experiments.

The Congressional Record (CR) and Federal Register (FR) data set has been gathered by the LDC.  Particularly lengthy documents from either source were not included because they cause serious annotation problems.  This set of data will be provided free-of-charge to all participants in the HARD track.  If you still need the data, please send a message to the LDC's membership coordinator (ldc@ldc.upenn.edu) requesting a copy of the HARD GovDocs Corpus, catalogue number LDC2003E15.  Once they have a signed user agreement from you, the data will be shipped out on CD.

Topics

Topics will follow the basic TREC style, but will be more richly annotated with metadata that describes the searcher and the context of the query.  The format of a topic will be:
<top>
<num> Number: HARD-
nnn
<title> Web-style description of topic
<desc> Description: Sentence-length description of topic
<narr> Narrative: Paragraph-length description of topic, indended primarily to help future relevance assessors
<hard> item= label, value=value
<hard> item= label, value=value
<hard> item= label, value=value
...
</top>  
The following metadata items will be created during topic creation:
  1. item=PURPOSE represents why the user is searching for the information.
  2. item=GENRE represents the type of material the searcher is interested in.
  3. item=FAMILIARITY  represents how familiar the searcher is with the topic.  Presumably a user who is fully aware of the details of a topic would not be interested in background material, for example.
  4. item=GRANULARITY captures the amount of text that the searcher is anticipating in a value response.
  5. item=RELATED-TEXT.  This item includes sample relevant text.  It may be repeated if there are multiple sample texts to be included.
During topic creation, the LDC will make an effort to have topics vary across each of the indicated metadata items. This Web page shows what  the LDC's topic creation form looks like in case that is of interest.

When the LDC created the actual evaluation topics, they also gathered additional metadata beyond what was required by the HARD track.  That information will be provided along with the HARD metadata after the baseline runs and may be used in any way a site likes.  The items collected were:
No additional information is currently available to explain the possible values (where not obvious).

Relevance judgments

Documents will be annotated with one of the following:
In addition, if the GRANULARITY value is not DOCUMENT, then each judgment will come with information that specifies which portion of the documents is relevant.  The HARD track will use the same approach used by the question answering track.  Systems will be required to provide the byte offset and length of the selected passage.  The offset will be from the "<" in the "<DOC>" tag of the original document (an offset of zero would mean include the "<" character).  The length will indicate the number of bytes that are included.  If a document contains multiple relevant passages, the document will be listed multiple times.

NIST has provided a perl script that extracts passages in the format that is expected for judgments.  In the active participants portion of the TREC Web site, see the "examples" link in the QA (not HARD!) section of the tracks page.  It also includes examples of what this will look like.  You should ensure that your passage identifiers are consistent with that information or the judging may judge passages that you did not intend.  If you wish to use the script yourself, please note that you must change the definition of the root directory fo reach portion of the AQUAINT collection (the %root_dir hash).  The script assumes the directory structure underneath each root is identical to how it appears on the CDs, except that each file is gunzipped.

Training data

The LDC will provide10 training topics.  The topics will incorporate a selection of metadata values and will come with relevance judgments.  Because of another large project at the LDC, there is some possibility that the judgments will be incomplete or completed slowly.  The topics in both baseline format and with complete metadata are available here .  Relevance judgments for those topics are available here (no passage-level judgments were done, though the LDC hopes to provide a few by the end of July: roughly half of the evaluation topics will have passage judgments done).

In addition, the LDC will provide some mechanism to allow sites to validate their clarification forms.  It will work something like this:
  1. A site sends one clarification form for one of the 10 training topics to the LDC.
  2. The LDC ensures that the form satisfies the constraints listed below.
  3. The LDC fills out the form with random stuff.  (We are hoping that the LDC will be able to actually fill out the forms, but may not be able to provide that level of training)
  4. The resulting metadata fields are sent back to the site along with any comments about constraint satisfaction.

Results format (including passage identification)

Results will be returned in standard TREC format extended, though, to support passage-level submissions since it possible that the searcher's preferred response is the best passage (or sentence or phrase) of relevant documents.  Results will include the top 1000 documents (or top 1000 passages) for each topic, one line per document/passage per topic.  Since there are 50 topics, the results file will include up to 50,000 lines.  You are required to submit something for each topic, but you are not required to submit 1000 items in all cases.  Each line will have the format:

topic-id Q0 docno rank score tag psg-offset psg-length
where:
Ellen Voorhees at NIST has generously provided a perl script that you can use to ensure that the passage offset and length you are giving is correct.  The script, extract_passages, is specific to the HARD track: it only allows documents from the HARD corpus to be specified.  The script is posted in the HARD track section of the "for active participants only" TREC Web pages .  It is also available here .

The script takes a triple of <document-id, offset, length> as arguments are returns the text that constitutes that passage. To use the script, you will need to change the definition of root directory (%root_dir) and make sure that the document ext is set up as described in the header comment.

To verify that the offset and length you have corresponds to what NIST has (and therefore to what you will be judged against), you can download and run this shell script .  It is a C-shell script that you can either make executable or run as "csh test_extract.csh".  It assumes that you have set up the extract_passages.pl script as described above.  This is not an exhaustive test, but it should catch any major mistakes.

Step 0. Topic dispersal (July 15)

The initial topics that are sent out to sites will look precisely like classic TREC topics--i.e., title, description, and possibly narrative only.  The HARD metadata will not be provided at this point.

The initial topics are available here .

Step 1. Baseline retrieval (due July 31)

Using the initial topics, sites will make their best effort to provide an effective result.  For most people this will be a ranked list of documents that are believed to respond to the query.  However, a site may choose to "guess" at answers to the metadata and return something other than a ranked list.  For example, if it was "obvious" that only a phrase is the correct response, a site might choose to make that assumption.

Results must be in the formation described in the "Results format" section above..

Evaluation is described in more detail below.  However, note that evaluation is based on what the searcher actually wanted.  If the goal was a phrase and you provided 1000 documents (or the other way around), your score will be lower.

Sites may submit as many baseline runs as they like.  One run will be designated the primary run and its top-ranked documents will be guaranteed to be judged.  Others may be included depending on track participation and available time.

Types of baseline runs.

Step 2a. Clarification form (due July 31)

The purpose of this step is to allow sites to get a small amount of additional information from the searcher.  This will be done by providing a small Web page as a form with clarification questions/checkboxes/etc for the searcher to fill in.  The LDC is arranging so that the actual assessor will fill out the information requested.

In order to make this a scalable process, rules described in the clarification form guidelines must be followed. The assessor will spend no more than three (3) minutes filling out the form for a particular topic, meaning up to 150 minutes per site.

Some crude forms that show what this might look like and what the returned results would be are available here .  Be sure to look at the source to see some of the hidden fields.  Additional specifications for these forms are being developed.

NOTE: Information gathered in response to the clarification form will be made public once the track is complete.  For each topic, the following information will be collated across all forms: the site originating the form, the questions/options/etc asked, and the answers.

Sites may submit up to two clarification forms per topic.  Depending on track participation and the difficulty of processing forms, the LDC will attempt to respond to additional forms as possible.

Types of runs using clarification forms.

Step 2b. Query metadata (available August 1)

Another option is to have the HARD metadata provided so that it can be used directly to better process the query.

Sites may do both steps 2a and 2b, and may combine the information from them if they like.

Topic metadata was provided on August 13.   It is available here.  Please note:

Step 3. Final run (due September 7)

Use all of the information acquired so far to develop a better response for the searcher.   Results must be in the formation described in the "Results format" section above.

Sites may submit as many final runs as they like.  One run will be designated the primary run and its top-ranked documents will be guaranteed to be judged.  Others may be included depending on track participation and available time.  If you submit multiple runs, make it clear which is your preferred run.

The next section discusses evaluation.  Ultimately, though, the hope is that the results from this step are superior to those of the baseline run in Step 1.

Types of final runs.

Judging of submitted runs

The LDC will judge the top 100 documents from each run judged.  The LDC guarantees that one baseline run and one final run from each site will be judged.  Depending on participation and overlap across runs, the LDC will attempt to add more runs to the judging pool.

Evaluation

First, this is an exploratory track: explore.  Try out your own evaluation measures to see what you can figure out.  Toward that end, we may need to release relevance judgments before TREC.

NIST has tentatively agreed to directly support the evaluation. However, there are some concerns about the measures as they currently stand.  James Allan (UMass) is working with Ellen Voorhees (NIST) to pin this down.

We will calculate four measures on every submitted run (including the baseline run).  The first two measures will be purely at the document level.  That is, they will ignore the GRANULARITY metadata item.
  1. SOFT-DOC is the most generous and assumes that both SOFT-REL and HARD-REL documents are relevant.  Evaluation will be by mean average precision.
  2. HARD-DOC is the same, but only HARD-REL documents are considered relevant.  This measure tests whether sites are able to leverage the metadata information.  Evaluation will be by mean average precision.
For the next part of the evaluation, the granularity value will be incorporated.
  1. SOFT-PSG will incorporate the passage-level requirements on those topics that have GRANULARITY other than DOCUMENT.  Evaluation will be by a modified version of mean average precision (see below).
  2. HARD-PSG is the most stringent evaluation, where only indicated passages (where appropriate) of HARD-REL documents are considered relevant.  Evaluation will be by a modified version of mean average precision (see below).
We need to settle on the passage-based measure. Some proposals:

Passage-level measure one (Steve Robertson, May 7th)

The proposal is summarized as:
Note that if only entire documents are relevant and only entire documents can be retrieved, Steve's proposals almost collapses back to "normal" recall and precision.  The recall measure is fine, but precision ends up biased in favor of long relevant documents.  It would be ideal to fix that.

Passage-level measure two (Cheng Zhai, June 3rd)

The proposal extends measure one by claiming that getting any portion of relevant passages is good, but that the system should be penalized for providing extra (non-relevant) material.  By this measure, the system is not penalized for missing portions of the relevant passages.
Note that this measure has the problem that a relevant passage can be "retrieved" more than once, since multiple passages may overlap it, even if the retrieved passages do not overlap.

Again, note that this measure is biased in favor of how the system does on long passages.

Passage-level measure three (James Allan, June 8th)

Something of a combination of some of both ideas above:
This proposal needs to deal with what occurs when portions of a relevant passage are retrieved more than once.  As written, PPM-3 would be higher if that happens.  It could (and probably should) be written such that the second occurrence of relevant text hurts precision rather than helping it (i.e., "the total number of words from relevant passages that occur in retrieved passages").

And this measure is also biased in favor of how the system does on long passages.

Passage-level measure four (James Allan, June 8th)

A variation of the above that treats each passage equally in the averaging.  Fundamentally, a system gets partial recall for getting part of the passage, and the total score a passage can get is one.
By summing over the set of relevant passages, we penalize systems for retrieving the same relevant amterial twice, since they don't get to count it twice in the numerator.

Note that if passages are defined to be entire documents, then RETPSG-GOT() yields the lengths of documents or 0, so the summation is of zeros and ones.  The result is the RPM-4 and PPM-4 collapse back to the classic measures of recall and precision.