README Henry Feild Most Recent Update: 08-Dec-2008 All Updates: 08-Dec-2008 (changed "DOCID" to "DOCNO" in the RUNNING section; added a CONTACT section) Creation Date: 21-Nov-2008 SETTING UP Move this directory to where ever you want it 'installed'. On Swarm, this should be done in your /work1/ directory, as this directory contains the default stage-storage directory (the TupleFlow 'tmp' directory). This can get very large at times, and therefore it is unwise to have it in your home directory, which should never contain more than 1GB of data. Once the 'extractNGramCounts' directory is moved to where you would like it to reside, run the setup script: $> ./setup.sh This will initialize the extractNGramCounts.sh script and link to the Google N-Gram files (which are in Bob's directory on Swarm). It will also give you a line that should be added to your .bashrc file. If you do not add this line to your .bashrc file, TupleFlow probably will not run. RUNNING To use the NGram count extractor once you have run ./setup.sh, you first need a file that contains the ngrams you want counted. These should be in a file in the format: blah a b a c w q e d a b i w ... The doc bussiness is not used, but is neccessary for the parsing to take place correctly. There should be one ngram per line of the file, with the terms being white-space separated. Lets say that your ngrams are stored in the file 'my_ngrams.txt' and you would like the counts to go into a file called 'my_ngram_counts.txt'. $> ./extractNGramCounts.sh my_ngrams.txt my_ngram_counts.txt The counts for your ngrams will now appear in the file 'my_ngram_counts.txt'. When the above script is executed, the 1- through 5-gram Google NGram files are parsed. If you know that all of the ngrams in 'my_ngrams.txt' are 1- through 3-grams, you can tell the script to only parse the 1- through 3-gram Google NGram files: $> ./extractNGramCounts.sh -n 3 my_ngrams.txt my_ngram_counts.txt The argument to '-n' can be 1, 2, 3, 4, or 5 (there would be more options if the Google NGram files went any higher :) ). Both of the above steps will run TupleFlow over the cluster, meaning the job is distributed. In the event that you want to run everything on the local computer, you can do: $> ./extractNGramCounts.sh -l my_ngrams.txt my_ngram_counts.txt This will take longer in most cases (if you are processing unigrams, this is probably quicker due to the small of overhead and small amount of parsing). QUESTIONS / COMMENTS Please send questions and comments to hfeild cs umass edu. Thanks!