Skip to topic | Skip to bottom
Home
CS646
CS646.BrandonsScriptr1.1 - 08 Oct 2003 - 20:04 - BrandonGoldsworthytopic end

Start of topic | Skip to actions

Description

Parses the first n document ids out of Lemur's results file, and then parses those n documents out of a notags document collection, into another textfile so they can be easily browsed.

It takes less than a minute to parse 30 documents out of the full notags collection (on my 1.2Ghz machine) and uses only about 4 Mb of RAM since it does everything on-the-fly. In total, much faster than waiting for notepad to search for a document number in a 100 meg file in RAM.

Unfortunately it doesn't work on the tagged collection since Python's xml parser is expecting an xml formatted page and not all of the crawled webpages are xml/xhtml.

Regarding Lucene

I don’t know what Lucene's results output looks like, but as long as its in columnar form, delimited by spaces, this should work with very little modification for Lucene's output, too.

Usage

All parameters can be specified on the command line (number of documents to retrieve, results file, document file, and the output file).

getResDocs.py [count=N] [docsFile=filename] [resultsFile=filename] [outfile=filename]

  • All parameters are optional.
  • Default values will be used for missing parameters.

Example: python getResDocs.py count=60 outfile=results.txt

  • Will extract the first 60 results from the default results/docs files and print output to the file results.txt

Defaults:

  • outfileName="outfile.txt"
  • resultsfileName = "lemur_res\sofar_topics.res"
  • docsfileName = "govcrawl-notags.docs"
  • defaultResultCount = 30
  • Easy enough to change, though.

Download

http://goldy.wmboheroes.com/gigo/getResDocs.py

I'm interested in seeing any changes/improvements. Please share them here.

-- BrandonGoldsworthy - 08 Oct 2003
to top


You are here: CS646 > BrandonsScript

to top

Copyright © 1999-2008 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback