HARD, High Accuracy Retrieval
from Documents
TREC 2003 track guidelines
Last revised August 17, 2003
Main HARD page .
This page will include or point to all of the information
necessary to participate in the HARD track of TREC 2003. Please
address questions of clarification to the HARD mailing list, hard@ciir.cs.umass.edu . You
may prefer to first check the mailing list archives
to ensure your question has not been answered.
The creation of the HARD corpus and its evaluation are
sponsored by DARPA's TIDES program.
Topic development, interaction, and annotation will be
handled by the Linguistic Data
Consortium (LDC).
Issues/Concerns
- Finalize the passage-level evaluation
Changes to document
- August 17, added pointer to topics with metadata
- July 23, updated corpus description to make it clear that only documents
from 1999 are included.
- July 20, added pointer to clarification
form guidelines and made miscellanous updates as described in an email
message of July 17--mostly centered around how many submissions would be
judged.
- July 18, added test_extracts shell script.
- July 16, listed additional LDC metadata; talked about what makes
a run manual or automatic.
- July 15, linked in training relevance judgments and evaluation
topics
- June 23rd, provided information about submitted results format
(tweaked on June 24th)
- June 22nd, corrected total number of documents in the corpus
- June 14th, added pointer to training topics
- June 11th, corrected count of CR and FR documents.
Corpus
The evaluation corpus will be a combination of newswire text
from the 199 portion of the AQUAINT corpus and U.S. government documents.
The following table provides details on the make-up of the corpus
(will be updated as the information is gathered). All information
is from only 1999 because only documents from that year should be included
(ignore other years' data in the AQUAINT corpus).
|
NYT,
|
APW,
|
XIE
|
CR
|
FR
|
Totals
|
1999
|
137,806
750Mb
Jan-Dec
|
77,876
245Mb
Jan-Nov
|
104,698
310Mb
Jan-Dec
|
16,609
147Mb
Jan-Dec
|
35,230
330Mb
Jan-Dec
|
372,219
1.7Gb
|
The New York Times (NYT), Associated Press Worldstream (APW),
and Xinghua English (XIE) articles are all available on the AQUAINT
disks. Those disks are available free-of-charge to all TREC participants.
If you do not already have the disks (they have been used in other
TREC tracks), follow the instructions in Ellen Voorhees' message welcoming
you to TREC to get them. (You have to fill out a data use form.)
The AQUAINT disks are also available directly from the LDC (catalogue
number
LDC2002T31 ) if you purchased the 2002 membership year, or for
US$180 for anyone. NOTE: The documents in the AQUAINT corpus are also
available in the LDC's "gigaword" corpus, but should not be used because
they are formatted differently and formatting is critical for the passage
experiments.
The Congressional Record (CR) and Federal Register (FR) data
set has been gathered by the LDC. Particularly lengthy documents
from either source were not included because they cause serious annotation
problems. This set of data will be provided free-of-charge to
all participants in the HARD track. If you still need the data,
please send a message to the LDC's membership coordinator (ldc@ldc.upenn.edu)
requesting a copy of the HARD GovDocs Corpus, catalogue number LDC2003E15.
Once they have a signed user agreement from you, the data will be
shipped out on CD.
Topics
Topics will follow the basic TREC style, but will be more
richly annotated with metadata that describes the searcher and the
context of the query. The format of a topic will be:
<top>
<num> Number: HARD-nnn
<title>
Web-style description of topic
<desc> Description:
Sentence-length description of topic
<narr> Narrative:
Paragraph-length description of topic, indended
primarily to help future relevance assessors
<hard> item=
label, value=value
<hard> item=
label, value=value
<hard> item=
label, value=value
...
</top>
The following metadata items will be created during topic creation:
- item=PURPOSE represents
why the user is searching for the information.
- value=BACKGROUND indicates
that the searcher wants to know where the topic came from.
- value=DETAILS means the
searcher wants to know the details of the topic.
- value=ANSWER indicates
the user is looking for an answer to a specific question. (This
value is probably tied to some of the GRANULARITY values.)
- value=ANY means that the
user has no specific purpose in mind or, at least, has not specified
one.
- item=GENRE
represents the type of material the searcher is interested in.
- value=OVERVIEW means the
searcher is interested in general news related to the topic.
- value=REACTION indicates
the searcher is looking for news commentary on the topic.
- value=I-REACTION is like
REACTION but is specifically about non-U.S. news commentary.
- value=ADMINISTRATIVE means
the search is interested in official US government documents.
- value=ANY indicates that
any genre is acceptable or none was indicated.
- item=FAMILIARITY represents
how familiar the searcher is with the topic. Presumably a user
who is fully aware of the details of a topic would not be interested in
background material, for example.
- value=1, no prior knowledge
- ...
- value=5, know details of
topic
- value=UNKNOWN means that
the user does not know his or her familiarity or has not specified
one.
- item=GRANULARITY captures
the amount of text that the searcher is anticipating in a value response.
- value=DOCUMENT means the
searcher is expecting complete documents (one or more).
- value=PASSAGE will be selected
when the search expects extracts from documents that are on the paragraph
or multi-paragraph level.
- value=SENTENCE means that
the retrieved units should be roughly at the sentence level.
- value=PHRASE means that
user is expecting a small number of words (including just one) as a response.
- value=ANY means the user
has no specific granularity in mind or did not specify one.
- item=RELATED-TEXT.
This item includes sample relevant text. It may be repeated
if there are multiple sample texts to be included.
- value="..." identified
text that is known to be related to the topic being specified. This
provides a kind of pre-query relevance feedback. The intent is that
this text not come from the evaluation corpus.
During topic creation, the LDC will make an effort to have
topics vary across each of the indicated metadata items. This Web page
shows what the LDC's topic creation form looks like in case
that is of interest.
When the LDC created the actual evaluation topics, they also gathered
additional metadata beyond what was required by the HARD track. That
information will be provided along with the HARD metadata after the baseline
runs and may be used in any way a site likes. The items collected
were:
- OCCUPATION
- SPECIAL TRAINING
- SPECIAL INTERESTS, where the annotator can candidly explain why
he or she chose this topic
- LANGUAGES SPOKEN
- AGE
- SEX
No additional information is currently available to explain the possible
values (where not obvious).
Relevance judgments
Documents will be annotated with one of the following:
- NON-RELEVANT means that the document is known not
to be relevant to the topic. (As is common in TREC, a document
without any judgment is assumed to be non relevant.)
- SOFT-REL means that the document is relevant to the topic
but that it does not satisfy the appropriate metadata. Given
the metadata items listed above, that means it either does not satisfy
the PURPOSE, GENRE, or the FAMILIARITY items (the others are not document-level
items).
- HARD-REL means that the document is relevant and
it satisfies the appropriate metadata.
In addition, if the GRANULARITY value
is not DOCUMENT, then each judgment will
come with information that specifies which portion of the documents is relevant.
The HARD track will use the same approach used by the question
answering track. Systems will be required to provide the byte offset
and length of the selected passage. The offset will be from the
"<" in the "<DOC>" tag of the original document (an offset of
zero would mean include the "<" character). The length will
indicate the number of bytes that are included. If a document contains
multiple relevant passages, the document will be listed multiple times.
NIST has provided a perl script that extracts passages in the format
that is expected for judgments. In the active participants portion
of the TREC Web site, see the "examples" link in the QA (not HARD!)
section of the tracks page. It also includes examples of what this
will look like. You should ensure that your passage identifiers
are consistent with that information or the judging may judge passages
that you did not intend. If you wish to use the script yourself,
please note that you must change the definition of the root directory
fo reach portion of the AQUAINT collection (the %root_dir hash). The
script assumes the directory structure underneath each root is identical
to how it appears on the CDs, except that each file is gunzipped.
Training data
The LDC will provide10 training topics. The topics will
incorporate a selection of metadata values and will come with relevance
judgments. Because of another large project at the LDC, there is
some possibility that the judgments will be incomplete or completed slowly.
The topics in both baseline format and with complete metadata are available here . Relevance judgments for those
topics are available here (no passage-level
judgments were done, though the LDC hopes to provide a few by the end of
July: roughly half of the evaluation topics will have passage judgments done).
In addition, the LDC will provide some mechanism to allow sites
to validate their clarification forms. It will work something
like this:
- A site sends one clarification form for one of
the 10 training topics to the LDC.
- The LDC ensures that the form satisfies the constraints
listed below.
- The LDC fills out the form with random stuff. (We
are hoping that the LDC will be able to actually fill out the
forms, but may not be able to provide that level of training)
- The resulting metadata fields are sent back to the site
along with any comments about constraint satisfaction.
Results format (including passage identification)
Results will be returned in standard TREC format extended, though,
to support passage-level submissions since it possible that the searcher's
preferred response is the best passage (or sentence or phrase) of
relevant documents. Results will include the top 1000 documents
(or top 1000 passages) for each topic, one line per document/passage
per topic. Since there are 50 topics, the results file will include
up to 50,000 lines. You are required to submit something for each
topic, but you are not required to submit 1000 items in all cases. Each
line will have the format:
topic-id Q0 docno rank score tag psg-offset
psg-length
where:
- topic-id represents the topic number from the topic
(e.g., HARD-001)
- "Q0" is a constant provided for historical reasons
- docno represents the document that is being retrieved
(or from which the passage is taken)
- rank is the rank number of the document/passage in the
list. Rank should start with 1 for the document/passage that
the system believes is most likely to be relevant and continue to 1000.
Note that for the evaluation code, score is the critical field,
not rank.
- score is a system-internal score that was assigned to
the document/passages. High values of score are assumed
to be better, so score should generally drop in value as rank
increases. WARNING: The standard TREC evaluation code sorts results
by this field and breaks ties randomly--it does not use the rank
field to break ties. Per Ellen Voorhees: "if the scores are truly
tied, then the evaluation should be able to use any ordering it pleases
of those tied scores.... If [you] want the exact ranking [you] submit evaluated,
then the score must reflect that ranking."
- tag is a unique identifier for this run by your site.
The identifier can be up to 12 letters or digits (no other characters).
It is common (but not required) for sites to use the first several
characters to identify their site and the rest to identify the specific
track and/or run within the site.
- psg-offset indicates the byte-offset in document docno
where the passage starts. A value of zero represents the
"<" in "<DOC>" at the start of the document. A value of negative
one (-1) means that no passage has been selected and the entire document
is being retrieved.
- psg-length represents how many bytes of the document
are included in the passage. A value of negative one (-1) must be
supplied when psg-offset is negative one.
Ellen Voorhees at NIST has generously provided a perl script that
you can use to ensure that the passage offset and length you are giving
is correct. The script, extract_passages, is specific to the HARD
track: it only allows documents from the HARD corpus to be specified. The
script is posted in the HARD track section of the "for active participants
only" TREC Web pages . It is also
available here .
The script takes a triple of <document-id, offset, length>
as arguments are returns the text that constitutes that passage. To use
the script, you will need to change the definition of root directory (%root_dir)
and make sure that the document ext is set up as described in the header
comment.
To verify that the offset and length you have corresponds to what NIST
has (and therefore to what you will be judged against), you can download
and run this shell script . It is a
C-shell script that you can either make executable or run as "csh test_extract.csh".
It assumes that you have set up the extract_passages.pl script as described
above. This is not an exhaustive test, but it should catch any major
mistakes.
Step 0. Topic dispersal (July 15)
The initial topics that are sent out to sites will look
precisely like classic TREC topics--i.e., title, description, and
possibly narrative only. The HARD metadata will not be
provided at this point.
The initial topics are available here
.
Step 1. Baseline retrieval (due July 31)
Using the initial topics, sites will make their best effort
to provide an effective result. For most people this will be
a ranked list of documents that are believed to respond to the query.
However, a site may choose to "guess" at answers to the metadata
and return something other than a ranked list. For example, if
it was "obvious" that only a phrase is the correct response, a site might
choose to make that assumption.
Results must be in the formation described in the "Results
format" section above..
Evaluation is described in more detail below. However,
note that evaluation is based on what the searcher actually
wanted. If the goal was a phrase and you provided 1000 documents
(or the other way around), your score will be lower.
Sites may submit as many baseline runs as they like. One
run will be designated the primary run and its top-ranked documents will
be guaranteed to be judged. Others may be included depending on
track participation and available time.
Types of baseline runs.
- An automatic baseline run is one that is done entirely
without human intervention. A system reads the topic files, constructs
and appropriate query, runs the query against the evaluation corpus, and
generates the result.
- A manual baseline run is one that permits human involvement
at any point in the process. That usually means that the human assists
in the process of converting the topics into queries, though it could take
other forms.
Step 2a. Clarification form (due July 31)
The purpose of this step is to allow sites to get a small
amount of additional information from the searcher. This will
be done by providing a small Web page as a form with clarification questions/checkboxes/etc
for the searcher to fill in. The LDC is arranging so that the
actual assessor will fill out the information requested.
In order to make this a scalable process, rules described
in the clarification form guidelines must
be followed.
The assessor will spend
no more than three (3) minutes filling out the form for a particular
topic, meaning up to 150 minutes per site.
Some crude forms that show what this might look like and
what the returned results would be are available here
. Be sure to look at the source to see some of the hidden fields.
Additional specifications for these forms are being developed.
NOTE: Information gathered in response to the clarification
form will be made public once the track is complete. For each
topic, the following information will be collated across all forms: the
site originating the form, the questions/options/etc asked, and the answers.
Sites may submit up to two clarification forms per topic. Depending
on track participation and the difficulty of processing forms, the LDC
will attempt to respond to additional forms as possible.
Types of runs using clarification forms.
- An automatic run is one that is done entirely without on-site
human intervention. A system reads the topic files and constructs
a clarification form. The process of going from topic to form is automatic.
Note that the form itself is filled out by a person at the LDC, but
that automatic runs are otherwise done by machine.
- A manual run is one that permits human involvement at any
point in the process. In this context that might mean that a human
would help generate the form.
Step 2b. Query metadata (available August 1)
Another option is to have the HARD metadata provided so
that it can be used directly to better process the query.
Sites may do both steps 2a and 2b, and may combine
the information from them if they like.
Topic metadata was provided on August 13. It is available here. Please
note:
- Topic 148 in the file should be ignored; it was included by accident
- The FAMILIARITY value for topic 177 was left blank and should have
a value of UNKOWN
Step 3. Final run (due September 7)
Use all of the information acquired so far to develop a
better response for the searcher. Results must be in the
formation described in the "Results format" section above.
Sites may submit as many final runs as they like. One run
will be designated the primary run and its top-ranked documents will be
guaranteed to be judged. Others may be included depending on track
participation and available time. If you submit multiple runs,
make it clear which is your preferred run.
The next section discusses evaluation. Ultimately,
though, the hope is that the results from this step are superior
to those of the baseline run in Step 1.
Types of final runs.
- An automatic final run is one that is done entirely without
on-site human intervention. A system reads the topic files, metadata,
and possible responses to clarification forms, constructs and appropriate
query, runs the query against the evaluation corpus, and generates the
result.
- A manual baseline run is one that permits human involvement
at any point in the process. That usually means that the human assists
in the process of converting the topics into queries, in this case probably
after the clarification form is returned.
- A run that is manual at any step along the way is manual.
Judging of submitted runs
The LDC will judge the top 100 documents from each run judged. The
LDC guarantees that one baseline run and one final run from each site will
be judged. Depending on participation and overlap across runs, the
LDC will attempt to add more runs to the judging pool.
Evaluation
First, this is an exploratory track: explore. Try out your own
evaluation measures to see what you can figure out. Toward that
end, we may need to release relevance judgments before TREC.
NIST has tentatively agreed to directly support the evaluation.
However, there are some concerns about the measures as they currently
stand. James Allan (UMass) is working with Ellen Voorhees (NIST)
to pin this down.
We will calculate four measures on every submitted run (including
the baseline run). The first two measures will be purely at the
document level. That is, they will ignore the GRANULARITY metadata item.
- SOFT-DOC is the most generous and assumes that both SOFT-REL
and HARD-REL documents are relevant. Evaluation will be by mean
average precision.
- HARD-DOC is the same, but only HARD-REL documents are
considered relevant. This measure tests whether sites are able
to leverage the metadata information. Evaluation will be by mean
average precision.
For the next part of the evaluation, the granularity value will
be incorporated.
- SOFT-PSG will incorporate the passage-level requirements
on those topics that have GRANULARITY other
than DOCUMENT. Evaluation will be
by a modified version of mean average precision (see below).
- HARD-PSG is the most stringent evaluation, where only
indicated passages (where appropriate) of HARD-REL documents are considered
relevant. Evaluation will be by a modified version of mean average
precision (see below).
We need to settle on the passage-based measure. Some proposals:
Passage-level measure one (Steve Robertson, May 7th)
The proposal is summarized as:
- A relevant passage is "returned" if the start of
the passage is retrieved.
- Calculate a recall passage measure, RPM-1 to be the proportion
of relevant passages returned (number of relevant passages whose starts
are returned divided by the total number of relevant passages).
- Calculate a precision passage measure, PPM-1, to be the
proportion of words in all of the system-returned passages that are
contained in relevant passages (number of words in the overlap of the relevant
and system-returned passages divided by the number of words
in the system-returned passages). This is a measure of the overlap
of the two sets of passages.
- Use these measures to create recall-precision graphs and
ideally a mean average passage precision (which Steve claims is tricky).
Note that if only entire documents are relevant and only entire
documents can be retrieved, Steve's proposals almost collapses
back to "normal" recall and precision. The recall measure is
fine, but precision ends up biased in favor of long relevant documents.
It would be ideal to fix that.
Passage-level measure two (Cheng Zhai, June 3rd)
The proposal extends measure one by claiming that getting any portion
of relevant passages is good, but that the system should be penalized
for providing extra (non-relevant) material. By this measure, the
system is not penalized for missing portions of the relevant passages.
- A relevant passage is "returned" if any portion of
it is retrieved.
- Calculate a recall passage measure, RPM-2, to be the proportion
of relevant passages returned. This is identical to RPM-1 in construction,
but depends upon the different idea of a successful retrieval of a relevant
passage.
- Calculate an "information gain/effort" ration (normalized
utility?) as follows. Let RETPSG be the set of retrieved passages
and let RELPSG be the set of relevant passages. Let RELRETPSG be
the subset of RETPSG where the passage overlaps some passage in
RELPSG. Let MISSREL be the portions of passages in RELPSG that do
not overlap with any passage in RETPSG.
- NU-2 = (total words in RETRELPSG) / (total words in RETPSG
+ total words in MISSEDREL)
- If RETRELPSG is empty, then NU-2 = 0
- Use the RPM-2 and NU-2 measures to create a recall-normalized
utility graph. Can also create a mean average precision type of
measure.
Note that this measure has the problem that a relevant passage
can be "retrieved" more than once, since multiple passages may overlap
it, even if the retrieved passages do not overlap.
Again, note that this measure is biased in favor of how the system
does on long passages.
Passage-level measure three (James Allan, June 8th)
Something of a combination of some of both ideas above:
- Let RPM-3 be the proportion of relevant text that is retrieved
by the passages (i.e., the number of retrieved words that overlap some
relevant passage divided by the number of words in all relevant passages).
- Let PPM-3 be the proportion of retrieved words that occur
in a relevant passage (i.e., the total number of retrieved words that
overlap some relevant passage divided by the number of words
in all retrieved passages).
- Use the RPM-3 and PPM-3 measures to create a recall-precision
graph and to create mean average precision type measures.
This proposal needs to deal with what occurs when portions of a
relevant passage are retrieved more than once. As written, PPM-3
would be higher if that happens. It could (and probably should) be
written such that the second occurrence of relevant text hurts precision
rather than helping it (i.e., "the total number of words from relevant
passages that occur in retrieved passages").
And this measure is also biased in favor of how the system does
on long passages.
Passage-level measure four (James Allan, June 8th)
A variation of the above that treats each passage equally in the
averaging. Fundamentally, a system gets partial recall for getting
part of the passage, and the total score a passage can get is one.
- Let RELPSG be the set of relevant passages
- Let RETPSG be the set of retrieved passages
- Let RETPSG-GOT(x) be the number of words of passage "x" that
were retrieved by some passage in RETPSG.
- Let RPM-4 = (SUMR in RELPSG RETPSG-GOT(R) / LEN(R))
/ |RELPSG|
- Let PPM-4 = (SUMR in RELPSG RETPSG-GOT(R) / LEN(R))
/ |RETPSG|
By summing over the set of relevant passages, we penalize systems
for retrieving the same relevant amterial twice, since they don't get
to count it twice in the numerator.
Note that if passages are defined to be entire documents, then
RETPSG-GOT() yields the lengths of documents or 0, so the summation
is of zeros and ones. The result is the RPM-4 and PPM-4 collapse
back to the classic measures of recall and precision.