README
Henry Feild
Most Recent Update: 08-Dec-2008
All Updates: 08-Dec-2008 (changed "DOCID" to "DOCNO" in the RUNNING
section; added a CONTACT section)
Creation Date: 21-Nov-2008
SETTING UP
Move this directory to where ever you want it 'installed'. On Swarm, this should
be done in your /work1/ directory, as this directory contains the default
stage-storage directory (the TupleFlow 'tmp' directory). This can get very large
at times, and therefore it is unwise to have it in your home directory, which
should never contain more than 1GB of data.
Once the 'extractNGramCounts' directory is moved to where you would like it to
reside, run the setup script:
$> ./setup.sh
This will initialize the extractNGramCounts.sh script and link to the Google
N-Gram files (which are in Bob's directory on Swarm). It will also give you
a line that should be added to your .bashrc file. If you do not add this line to
your .bashrc file, TupleFlow probably will not run.
RUNNING
To use the NGram count extractor once you have run ./setup.sh, you first need a
file that contains the ngrams you want counted. These should be in a file in the
format:
blah
a b
a c
w q e
d a
b i w
...
The doc bussiness is not used, but is neccessary for the parsing to take place
correctly. There should be one ngram per line of the file, with the terms being
white-space separated.
Lets say that your ngrams are stored in the file 'my_ngrams.txt' and you would
like the counts to go into a file called 'my_ngram_counts.txt'.
$> ./extractNGramCounts.sh my_ngrams.txt my_ngram_counts.txt
The counts for your ngrams will now appear in the file 'my_ngram_counts.txt'.
When the above script is executed, the 1- through 5-gram Google NGram files are
parsed. If you know that all of the ngrams in 'my_ngrams.txt' are 1- through
3-grams, you can tell the script to only parse the 1- through 3-gram Google
NGram files:
$> ./extractNGramCounts.sh -n 3 my_ngrams.txt my_ngram_counts.txt
The argument to '-n' can be 1, 2, 3, 4, or 5 (there would be more options if
the Google NGram files went any higher :) ).
Both of the above steps will run TupleFlow over the cluster, meaning the job is
distributed. In the event that you want to run everything on the local computer,
you can do:
$> ./extractNGramCounts.sh -l my_ngrams.txt my_ngram_counts.txt
This will take longer in most cases (if you are processing unigrams, this is
probably quicker due to the small of overhead and small amount of parsing).
QUESTIONS / COMMENTS
Please send questions and comments to hfeild cs umass edu.
Thanks!