Baseline:
Index partial Trec Terabyte (GX000-GX009) using indri on chadstone.
See Results
BaseLine
Table of results
AllResults
Experiments:
Preliminary stuff:
Get Indri from www.lemurproject.org
Extract the archive, run configure, run make
Find the GOV2 collection data we sent them
Uncompress the data, and put it in a directory
(I'll call this directory $CORPUS)
It should match /usr/ind1/tmp3/indri/collections/gov2/gov2-corpus
(specifically, it should have subdirectories GX000 to GX272, and be
approximately 426GB of data)
Building a GOV2 index:
Let $INDEX be the name of the index you want to make. (Indri will make
the directory $INDEX and put all its files inside)
indri/buildindex/buildindex -index=$INDEX -corpus.path=$CORPUS -corpus.class=trecweb -memory=2g
This command will print status to the screen. The job takes
approximately 35 hours on our Indri machines.
Building a smaller index:
Since the GOV2 index will take so long to build (and so much space to
store), it isn't likely to be the right testbed for index building
tests. I'd suggest using the first few directories of GOV2 for this
purpose. Make a parameter file like index-paramfile (bottom of this
email) and run this index job:
indri/buildindex/buildindex -index=$INDEX index-paramfile -memory=2g
Running a query job:
Get the query file from Don (50000 queries)
command is:
indri/runquery/runquery don-queryfile -index=$INDEX -count=20
Running the full query log takes 20 hours on our hardware
Running a smaller query set:
Again, you may want to trim down the query log to something smaller so
you can run some simultaneous query tests. It's just an XML file, so it
shouldn't be hard to trim down. They take about 1 second a query, so
2000 or 3000 queries should be enough.
cat <<EOF > index-paramfile
<parameters>
<corpus>
<path>$CORPUS/GX000</path>
<class>trecweb</class>
</corpus>
<corpus>
<path>$CORPUS/GX001</path>
<class>trecweb</class>
</corpus>
<corpus>
<path>$CORPUS/GX002</path>
<class>trecweb</class>
</corpus>
<corpus>
<path>$CORPUS/GX003</path>
<class>trecweb</class>
</corpus>
<corpus>
<path>$CORPUS/GX004</path>
<class>trecweb</class>
</corpus>
<corpus>
<path>$CORPUS/GX005</path>
<class>trecweb</class>
</corpus>
<corpus>
<path>$CORPUS/GX006</path>
<class>trecweb</class>
</corpus>
<corpus>
<path>$CORPUS/GX007</path>
<class>trecweb</class>
</corpus>
<corpus>
<path>$CORPUS/GX008</path>
<class>trecweb</class>
</corpus>
<corpus>
<path>$CORPUS/GX009</path>
<class>trecweb</class>
</corpus>
</parameters>
EOF
--
AndreGauthier - 20 Jul 2005
to top