Hosted at galagosearch.org.
Hosted at lemurproject.org.
This is a set of three Java classes, RetrievalEvaluator, SetRetrievalEvaluator, and SetRetrievalComparator. They compute most common information retrieval metrics, like average precision, precision, recall, geometric MAP, and BPREF.
Included is a jar file that can be used just like trec_eval, with similar output.
15 November 2006: Updated: Now with NDCG.
30 November 2006: Updated: Now with statistical tests.
26 January 2007: Updated: Now with randomized tests (with good convergence tests) plus NDCG@15.
ireval.tar.gz (Requires JDK 1.5 or later)A simple utility that uses an arbitrary amount of physical memory on a system (by using mmap and mlock calls). This program is useful for testing the performance of programs under low memory conditions.
Note that you may need root access to lock memory pages.
memlock.c (Tested on Linux and Mac OS X)A web page/application for counting down the minutes until the new year. A satellite map of the world shows a green line that approximates the spot on the globe where it is currently midnight. On New Year's Eve, it will be 2007 to the right of this line and 2006 to the left. Clocks on the bottom of the screen indicate the time until midnight at cities around the globe (three of the cities represent the next three time zones that will cross midnight).
You can click here to go to a version that is optimized for 20" widescreen displays (1680x1050). You can also download a version and tweak it to fit your display (I have some scaled versions of the satellite photo in there).
This requires a recent version of Firefox or Safari to display properly (Internet Explorer will not work). For Safari, I recommend Saft to get a true browser full screen mode.
This Python library transforms simple keyword queries into more complex Indri formulations.
This program:
from xform import *
query = simple_parse( "indri query langauge" )
print query
print "---"
print query.transform( DependenceModel() )
Prints this output (although not as nicely formatted)
#combine(indri query language)
---
#weight(0.800000 #combine(indri query language)
0.100000 #combine(#1(indri query language)
#1(indri query)
#1(query language))
0.100000 #combine(#uw12(indri query language)
#uw8(indri query)
#uw8(indri language)
#uw8(query language)))
xform.py
This little python script gives you a summary of the hosts that are accessing a web page by parsing Apache query logs.
I have edited the sample below somewhat so that it only includes search engine bots in order to protect the privacy of people who access my page.
% tail -10000 access_log | grep strohman | python ~/hostify.py
googlebot.com
crawl-66-249-65-46.googlebot.com
inktomisearch.com
fj5008.inktomisearch.com
lj612196.inktomisearch.com
md301002.inktomisearch.com
hostify.py
This Python library is meant to allow large parameter-sweep submissions to a Grid Engine cluster. Each individual cluster job is represented by a Job object, and these objects can have dependencies on other jobs. Once a set of job objects has been created, the sge.build_submission function dispatches these jobs for execution on the cluster. The library automatically redirects stdout and stderr of the submitted jobs.
sge.py (Requires Python 2.x, Sun Grid Engine 6.0)This Python library lets you work with multi-column text files (e.g. TREC judgments, ranking files, and trec_eval output) in ways similar to what you'd expect from a relational database. Supports join, project, permute, and aggregate (similar to SQL GROUP BY) along with a function for reading text files in multi-column format.
fielded.py (Requires Python 2.x)This Python script will fetch the BibTeX for any paper in the ACM Digital Library, and automatically add it to your BibTeX file. It's used like this:
% python acm_fetch.py \
"Learning relational probability trees"
@inproceedings{956830,
author = {Jennifer Neville and David Jensen
and Lisa Friedland and Michael Hay},
...
}
13 February 2007: Updated: Now downloads the papers, and stores a path to the paper (and its MD5 hash) in the BibTeX file. Also allows you to choose from multiple possible papers if the search results are ambiguous. Also supports the Yahoo! API.
A small patch to Eric Brill's rule based tagger (available here) to get it running on OS X. This also includes a small bug fix (lex.c: numspaces). Once you download the original tagger source and my patch, apply it and build like this:
uncompress RBT1_14.tar.Z
tar xvf RBT1_14.tar
cd RULE_BASED_TAGGER_V1.14
patch -p1 <../brill.patch
make
Note that I did not attempt to change many of the fixed-size buffers in the code; the tagger assumes that text lines are less than 5000 characters, and that each word is less than 256 letters. If you go over these bounds, the tagger will behave unpredictably. I found it easier to get around these limits by preprocessing than from changing the tagger code.
brill.patchThis script, when run as root on a fresh Fedora Core install, sets up the machine to be used as an automatic build test system. The script adds a special user to the machine called "indritest" that downloads a special script on login, executes it, deletes all local files, then shuts down the machine. The script also adds a new boot option called "indritest", which is the new default, that automatically logs in as user indritest.
This is primarily useful for making virtual machines using virtualization software like VMWare, Parallels, or Xen. If you want to use this for your own purposes, be sure to change the URL variable in the script so your machine downloads your script, not mine.
indritestsetup.sh (Requires a fresh Fedora Core Linux install and root access)Scholar is an OS X application for viewing research papers in PDF format. It automatically detects multicolumn page layouts and disassembles them so the columns are displayed in order. The resulting effect is a page layout like an online newspaper or blog, which makes the paper easier to read on screen.
I highly recommend using Zazen to dim your other applications while using Scholar.
Scholar (Requires Mac OS X)