<<O>>  Difference Topic BrandonsScript (r1.1 - 08 Oct 2003 - BrandonGoldsworthy)
Line: 1 to 1
Added:
>
>
META TOPICPARENT WebHome

Description

Parses the first n document ids out of Lemur's results file, and then parses those n documents out of a notags document collection, into another textfile so they can be easily browsed.

It takes less than a minute to parse 30 documents out of the full notags collection (on my 1.2Ghz machine) and uses only about 4 Mb of RAM since it does everything on-the-fly. In total, much faster than waiting for notepad to search for a document number in a 100 meg file in RAM.

Unfortunately it doesn't work on the tagged collection since Python's xml parser is expecting an xml formatted page and not all of the crawled webpages are xml/xhtml.

Regarding Lucene

I don’t know what Lucene's results output looks like, but as long as its in columnar form, delimited by spaces, this should work with very little modification for Lucene's output, too.

Usage

All parameters can be specified on the command line (number of documents to retrieve, results file, document file, and the output file).

getResDocs.py [count=N] [docsFile=filename] [resultsFile=filename] [outfile=filename]

  • All parameters are optional.
  • Default values will be used for missing parameters.

Example: python getResDocs.py count=60 outfile=results.txt

  • Will extract the first 60 results from the default results/docs files and print output to the file results.txt

Defaults:

  • outfileName="outfile.txt"
  • resultsfileName = "lemur_res\sofar_topics.res"
  • docsfileName = "govcrawl-notags.docs"
  • defaultResultCount = 30
  • Easy enough to change, though.

Download

http://goldy.wmboheroes.com/gigo/getResDocs.py

I'm interested in seeing any changes/improvements. Please share them here.

-- BrandonGoldsworthy - 08 Oct 2003

Revision -
Revision r1.1 - 08 Oct 2003 - 20:04 - BrandonGoldsworthy