UNIVERSITY OF MASSACHUSETTS AMHERST 
DEPARTMENT OF COMPUTER SCIENCE
CENTER FOR INTELLIGENT INFORMATION RETRIEVAL (CIIR)
Document written by Alvaro Bolivar [alvarob@cs.umass.edu]

CIIR TESTING COLLECTION FOR THE TREC NOVELTY TRACK 2002

In this package you will find the files/scripts necessary to build the 
testing collection used by the CIIR in the TREC-Novelty track 2002.  The
package gathers a set of scripts that need to be run in order to reproduce the
sentence segmentation of a sample of relevant documents for each topic in a 
subset of topics in the TREC ad-hoc retrieval track (topics 300-450).  
Additionally, it contains a set of relevant judgments produced by a group of 
undergraduate students, who were otherwise unaffiliated with our research, and 
necessary to be able to reproduce the results reported by us at the conference. 
The present document explain its contents, the format in which they are being 
distributed, as well as the right way to use them.

Follow the following steps...

1.  RELEVANT DOCUMENTS FOR EACH TOPIC.
Our collection is made of a subset of documents from TREC volumes 4 and 5
(Federal Register 1994, LA Times 1989-90, and FBIS 1996), and a subset of
topics from the TREC-7 and TREC-8 ad-hoc retrieval track.  

The file "all_new_topics.res" is the output of a traditional retrieval engine 
in the form of a ranked list, which contains the top 25 known-relevant 
documents (based on TREC judgments) for each of the 98 TREC topics that make 
part of our collection.  Each line in the script has the following format:
<TOPICID> Q0 <DOCID> <RANK> <SCORE> Exp

Note that not all the topics have 25 documents as some of them don't have as 
many relevant documents,  this is, only relevant documents where taken into 
consideration. 

2.  FROM TREC FORMAT TO SEGMENTATOR INPUT FORMAT

Before performing the sentence segmentation for the selected documents, it is 
necessary to put these documents in the proper format for the segmentation 
script.  Here, we made use of the gawk utility from UNIX.  Before running the 
script, make sure that the path to the gawk binary is set correctly.  The 
following command line will read over all the files for a standard TREC 
distribution and extract the documents in the ranked list into the subdirectory 
"temp". The documents will be saved in one file for each document with the 
following filename convention:
<DTD_MARKER>_<TOPICID>_<RANK>.dat

> trecdocs2topics.awk $TRECVOL4A5/*.dat

The above command line assumes that you have the documents for the TREC 
collection volumes 4 and 5 under the directory pointed by the enviromental 
variable $TRECVOL4A5.  The script will automatically generate a file named 
"warning.txt", which will report possible problems in the extraction process.  
In particular, it will let you know if all documents were found. 

3.  SEGMENTATION

Once you have extracted the documents, you can proceed with the segmentation.

3.1  Abbreviations and personal nouns databases.

The segmentation script makes use of an abbreviations and a personal nouns 
database.  The contents of these databases are provided in plain text (*.list 
files) and a perl script must be run to index them as they are system dependent.  
In order to accomplish this, simply run the following script:

> indexdbs.pl 

The output will be following db index files (output size for a server running 
SunOS 5.8):
 4096 abbrevs.dir
 2048 abbrevs.pag
 4096 pnouns.dir
16384 pnouns.pag

3.2  nsgmls

The segmentation script also makes use of the sgml parser "nsgmls".  This is an 
open source parser written in C by James Clark.  For more information and access 
to source/binaries go to [http://www.jclark.com/sp/].  Some unix systems may 
have installed already by default. 

3.3  Segmentation Script

The segmentation script is a modified version of a Perl script origially written 
by folks at NIST.  Before running the script, make sure that some of the paths 
at the begginning of the script are set to the correct values.  Then, execute:

> cd temp
> ../tagSentsForNovelty.pl t *.dat

The script will generate an output file for each input file.  Once the script is 
executed, you can use, merge or, organize the segmented documents (temp/*.S 
files) as you consider appropiate.     

4.  RELEVANCE JUDGMENTS

Relevance files were produced for 48 of the 98 selected topics.  They are 
included under the relevance directory.  For further details about the format an 
specification of the files and the track in general, go to 
[http://trec.nist.gov/].

All of the above scripts/files are provided as a service to the research 
community under no warranty whatsoever.  Please report any bugs to 
[alvarob@cs.umass.edu]
