Skip to topic | Skip to bottom
Home
Main
Main.UtExperimentsr1.2 - 27 Jul 2005 - 19:11 - JamesAllantopic end

Start of topic | Skip to actions
Baseline: Index partial Trec Terabyte (GX000-GX009) using indri on chadstone.

See Results BaseLine

Table of results AllResults

Experiments:

Preliminary stuff: Get Indri from www.lemurproject.org Extract the archive, run configure, run make

Find the GOV2 collection data we sent them Uncompress the data, and put it in a directory (I'll call this directory $CORPUS) It should match /usr/ind1/tmp3/indri/collections/gov2/gov2-corpus (specifically, it should have subdirectories GX000 to GX272, and be approximately 426GB of data)

Building a GOV2 index: Let $INDEX be the name of the index you want to make. (Indri will make the directory $INDEX and put all its files inside) indri/buildindex/buildindex -index=$INDEX -corpus.path=$CORPUS -corpus.class=trecweb -memory=2g This command will print status to the screen. The job takes approximately 35 hours on our Indri machines.

Building a smaller index: Since the GOV2 index will take so long to build (and so much space to store), it isn't likely to be the right testbed for index building tests. I'd suggest using the first few directories of GOV2 for this purpose. Make a parameter file like index-paramfile (bottom of this email) and run this index job: indri/buildindex/buildindex -index=$INDEX index-paramfile -memory=2g

Running a query job: Get the query file from Don (50000 queries) command is: indri/runquery/runquery don-queryfile -index=$INDEX -count=20 Running the full query log takes 20 hours on our hardware

Running a smaller query set: Again, you may want to trim down the query log to something smaller so you can run some simultaneous query tests. It's just an XML file, so it shouldn't be hard to trim down. They take about 1 second a query, so 2000 or 3000 queries should be enough.


cat <<EOF > index-paramfile
<parameters>
   <corpus>
      <path>$CORPUS/GX000</path>
      <class>trecweb</class>
   </corpus>
   <corpus>
      <path>$CORPUS/GX001</path>
      <class>trecweb</class>
   </corpus>
   <corpus>
      <path>$CORPUS/GX002</path>
      <class>trecweb</class>
   </corpus>
   <corpus>
      <path>$CORPUS/GX003</path>
      <class>trecweb</class>
   </corpus>
   <corpus>
      <path>$CORPUS/GX004</path>
      <class>trecweb</class>
   </corpus>
   <corpus>
      <path>$CORPUS/GX005</path>
      <class>trecweb</class>
   </corpus>
   <corpus>
      <path>$CORPUS/GX006</path>
      <class>trecweb</class>
   </corpus>
   <corpus>
      <path>$CORPUS/GX007</path>
      <class>trecweb</class>
   </corpus>
   <corpus>
      <path>$CORPUS/GX008</path>
      <class>trecweb</class>
   </corpus>
   <corpus>
      <path>$CORPUS/GX009</path>
      <class>trecweb</class>
   </corpus>
</parameters>
EOF

-- AndreGauthier - 20 Jul 2005
to top


You are here: Main > Projects > InfrastructureProposal > UtExperiments

to top

Copyright © 1999-2008 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback