HARD, High Accuracy Retrieval from Documents
TREC 2003 clarification form guidelines

This document describes guidelines for creating, submitting, and getting responses to clarification forms.

General rules

In order to make this a scalable process, the following rules must be honored.  Failure to do so means that your CF will either be misread or ignored:
  1. The CF must display correctly on Netscape V4.78 running on Solaris 2.5.1
  2. The CF cannot be larger than can be displayed on a 16-inch monitor (an earlier draft indicated incorrectly that a 17-inch monitor was the minimum)
  3. The screen real estate you have available is 1152 x 900
  4. The CF must be an HTML Web page.  No Javascript, no Java, no flash, no anything but HTML.
  5. The page may not refer to external images: it must be self-contained
  6. The following types of data entry will be permitted (others are possible, but check in advance):
The assessor will spend no more than three (3) minutes filling out the form for a particular topic, meaning up to 150 minutes per site.

Some crude forms that show what this might look like and what the returned results would be are available here .  Be sure to look at the source to see some of the hidden fields.  

NOTE: Information gathered in response to the clarification form will be made public once the track is complete.  For each topic, the following information will be collated across all forms: the site originating the form, the questions/options/etc asked, and the answers.

Fields in clarification forms

Your form must include the following items:
In addition, please include somewhere on the page the topic number (e.g., "001") and the title of the topic.  The purpose of including this is to provide a sanity check that the annotators are, indeed, answering the correct questions.

Note that the annotators will have the topic description, narrative, and their metadata values available to them for reference.  You do not need to put any of that information on the clarification form (except the title and topic number, as just mentioned).

Put the forms in separate files, on per topic, with the filename XXXXn_000.html where, again XXXX is your site code, n is the run number, and 000 is the topic number.  Here are example clarification forms for training topics 002 and 008 .

Submitting clarification forms

Two clarification forms from each site will be filled out.  If participation and time permits, the LDC will fill out additional forms from each site.  The naming convention on forms makes it clear the order in which you want the forms to be filled out (e.g., XXXX1 then XXXX2 then--if time--XXXX3 and so on).

Create a tar file that contains all of the forms for a run and optionally gzip the file.  The file should be named XXXXn.tar where XXXX is your site code and n is the run number.  The file should contain exactly 50 files  named XXXXn_000.tar.  Here is an example file that contains only two forms.

Submit your forms to ftp://ftp.ldc.upenn.edu/pub/ldc/csr_incoming . You will not see any files listed, but you will have permission to upload a file from your local machine to this ftp directory.
If you cannot upload the file with a browser, you can use old-fashioned ftp:
NOTE: After you have uploaded the file, you will not be able to see that it is there.  The incoming directory is a write-only directory that does not allow you to read its contents.  Do not be surprised that it is not readable.

When done, you must send an email message to Meghan Glenn (mlglenn@ldc.upenn.edu) and Stephanie Strassel (strassel@ldc.upenn.edu) so that LDC knows that the transfer has happened.

If you have any problems (e.g., because of firewall restrictions on your site), please check with your local networking support people, if possible, and contact the LDC only if you unable to resolve the problem locally.

Getting results back

You'll get back a tar file that looks like this .  The archive contains files named as XXXXn_topicid.000 where XXXX is the ID for the site, n is the run number, topicid is the three-digit topic number, and seq is a suffix intended to ensure that doubly-filled forms do not lose information.  The seq value will almost always be 001, though the example tar file includes a repeated value to illustrate.

Site codes (4 letters)

   CA SU San Marcos (Rocio Guillen): CASU
   Clairvoyance (David Evans): CLAI
   Glasgow: GLAS
   Illinois Urbana-Champaign (ChengXiang Zhai): ILUC
   IIT Bombay (Ganesh Ramakrishnan): IITB
   Microsoft Research Cambridge (Steve Robertson): MRCA
   Queens College SUNY (KL Kwok): QCSU
   Rutgers (Nick Belkin): RUTG
   Open Systems and Chinese Acad of Sciences (Zeng Wu): OPEN
   Sabir (Chris Buckley): SABI
   Tsinghua Univ, Computer Science IR Group (ShaoPingMa): TUCS
   Tsinghua Univ, Knowledge Engineering Group (Bin Wang): TUKE
   Univ Buffalo, CEDAR (Rohini Srihari): UBUF
   Univ of Helsinki (Antoine Doucet): UHEL
   Univ Maryland (Daqing He): UMAR
   UMass Amherst (James Allan): UMAS
   UNC Chapel Hill/Arctic: UNCC
   Univ Waterloo (Olga Vechtomova):UWAT