James Allan asked a few of the senior graduate students in the lab to write or present a bit about how we do research, and this is my contribution. Some of the material on this page will be specific to the CIIR, including things like machine names and collection locations. Other things are more general ideas that might apply to other information retrieval researchers.
Write every day. Do some kind of writing every day. Some people keep a research notebook with nice, neat, focused research writing. I use a program called MacJournal (recommended by Fernando), and I write just about anything I'm thinking about. I find that writing about ideas forces me to think about them more clearly and critically, and it can serve as a thought accelerator. In addition, writing in a journal is valuable writing practice. I am now in the habit of writing any time I feel "stuck", which sometimes happens many times a day. Writing helps me get back on track.
Work on presentation skills. I mean presentation broadly, as in any way that you present your ideas; whether by e-mail, in conversation with labmates, in posters, or in research talks. Conveying your ideas clearly is a critical part of being a researcher. Your ideas will not make an impact on the world until other people understand them. As a consequence, notice that the best known researchers are some of the best communicators.
Learn about research. There are lots of excellent resources about how to do research. Your advisor is a great resource. David Jensen teaches a course on Research Methods that I highly recommend. Even if you are not taking the class now, consider visiting the website and reading through some of the resources and papers. Reading this material is time well spent.
The information retrieval problem: Suppose a user submits a query Q. Find the best possible ranking of a collection of documents C so that the most relevant documents are first, and the least relevant documents are last.
Most research in the CIIR focuses on solving this problem. More formally, most of our research proposes some function f(Q, D) that, given a query and a document, produces some "score". Our hypothesis is that ordering the documents in C by the score f(Q, D) is better than some other function g(Q, D). An important part of what we do is testing this hypothesis.
We test this kind of hypothesis using an evaluation corpus, which consists of:
The relevance judgments are typically binary, meaning that a document is either relevant for a query or it is not. Multi-level relevance judgments (highly relevant, possibly relevant, not relevant) are possible but not currently common.
Typically the documents, queries, and relevance judgments come from TREC. The TREC documents come in a standard format that you can use directly with search tools like Lemur, Indri and Galago. The queries come in topic files which you will probably need to convert into some form that your search tools understand. In general, you will first create an index from the document collection, then run a program to process your set of queries using your new technique and produce a ranked list of results. You then use a program like trec_eval or ireval to evaluate your results. We typically measure effectiveness using Mean Average Precision or Precision@10, but more recent experiments have used NDCG as well.
In the future, most data collections will be found on sydney.cs.umass.edu in /work2/COLLECTIONS. The old data location was ~irdata on dandenong.cs.umass.edu, so if you can't find what you need on sydney, try dandenong:~irdata. As a last resort, try connecting to indri3.cs.umass.edu, then digging through Don Metzler's directory: /usr/ind3/tmp1/indri/metzler. Don keeps an enourmous amount of great TREC data in there, but unfortunately it's not easy to find.
Some indexes are already built and stored in sydney:/work2/INDEXES. Look here first to see if your index has already been built, especially if you want to work on a large document collection.
These are the most commonly used search tools in the lab, although there are other open source search toolkits out there (including Lucene, Terrier, Wumpus and Zettair).
David Fisher is the UMass maintainer of Lemur and Indri and is more than qualified to answer your questions on these toolkits. If you are using Indri, please see my Indri Tips page before asking David a question.
The change the search engine approach has possibly been less attractive in the past because students didn't particularly like C++, and Lemur is written in C++. Galago is written in Java, so if you're more comfortable with Java it might be a good place to start implementing.
Don Metzler has packaged all of the parameter files we used with Indri for the TREC 2005 Terabyte track. This is an excellent resource for new Indri users to see how to do the change the queries approach to research, and a good introduction to the Indri search engine.
Our lab is fortunate enough to have some impressive computational resources. We have a lot of Sun servers, including a 16-processor machine called fitzroy. We also have a number of individual Linux servers, including indri1 through indri6, and some others that are dedicated to particular projects (like kilcoy and burnie). The bulk of student research happens on our computational cluster, sydney.
sydney is not one machine, but approximately 30 machines connected together, for a total of over 70 usable processors. You connect to the sydney head node using ssh. However, once you've connected, you can type qsub my-research-work.sh, and your script my-research-work.sh will be run on one of the available processors as soon as possible. Often the system will have some idle processors so your program will start right away. At other times, when all the processors are busy, your job might take a while before it is executed. The qsub command will complete immediately in any case.
#$ -S /bin/bashIf you want to use a binary program, it's easiest to just call it from a shell script. There is a binary mode for qsub (-b y), but I find it doesn't work sometimes. If you want to do something more complicated, you can use array jobs. Typing this:
% qsub -t 1-10 myscript.shwill cause myscript.sh to run on 10 different processors. However, the shell variable TASK_ID will be set differently on each processor. The first copy of the script will run with $TASK_ID == 1, while another will have $TASK_ID == 2, ... continuing on to 10. This is particularly useful when you have to search a large space of parameters. I have a Python library called sge that may help you schedule more complicated Grid Engine jobs. This library is especially good at handling job dependencies (when you want script B to run only after all of the copies of script A have run). Using qsub is a rather crude way to handle the parallelism of the cluster. In the future I think more students will move toward programs that are inherently parallel. The Galago toolkit is meant to work this way. If you build text processing tasks on top of Galago, they will run in parallel automatically on sydney. Another good example is Pig, which is like a very flexible query language for big datasets. I have ported Pig so that it runs on Galago, which means that anything you write in the Pig language can automatically take advantage of the cluster resources.
To monitor the progress of your work, you can use the qstat command. Typing qstat will show you all jobs that are currently running or queued. qstat -f shows you that same information, but also shows which jobs are running on which machines (and which machines have idle processors). qstat -ext shows you more information than you could ever want to know.
Here is some example output from qstat:
job-ID prior name user state submit/start at queue ------------------------------------------------------------------------------------------ 573991 0.51432 v.r0.pbqcl selfso r 09/06/2007 15:52:40 all.q@compute-0-0.local 572057 0.50830 LDA3_1400 yixing r 08/27/2007 11:30:29 all.q@compute-0-1.local 573992 0.51432 v.r0.sumpb selfso r 09/06/2007 15:53:25 all.q@compute-0-10.local 573764 0.50830 LDA5_1200 yixing r 09/05/2007 11:21:56 all.q@compute-0-11.local 573781 0.51500 run_mdc.ba ronb r 09/05/2007 12:30:33 all.q@compute-0-12.local 573782 0.51500 run_mdc.ba ronb r 09/05/2007 12:30:33 all.q@compute-0-12.local 573803 0.51432 v.10g.thr2 selfso r 09/05/2007 15:38:50 all.q@compute-0-13.local 573978 0.51432 v.10g.pbqc selfso r 09/06/2007 10:52:02 all.q@compute-0-14.local 573736 0.50830 LDA6_600 yixing r 09/04/2007 20:33:41 all.q@compute-0-15.local 573787 0.51500 run_mdc.ba ronb r 09/05/2007 12:30:33 all.q@compute-0-16.local 573785 0.51500 run_mdc.ba ronb r 09/05/2007 12:30:33 all.q@compute-0-17.local 573784 0.51500 run_mdc.ba ronb dr 09/05/2007 12:30:33 all.q@compute-0-18.local 573788 0.51500 run_mdc.ba ronb r 09/05/2007 12:30:33 all.q@compute-0-19.local 573786 0.51500 run_mdc.ba ronb r 09/05/2007 12:30:33 all.q@compute-0-2.local 573737 0.50830 LDA6_800 yixing r 09/04/2007 20:33:41 all.q@compute-0-20.local 573780 0.51500 run_mdc.ba ronb r 09/05/2007 12:30:33 all.q@compute-0-21.local 573783 0.51500 run_mdc.ba ronb r 09/05/2007 12:30:33 all.q@compute-0-21.local 572065 0.50830 LDA4_1400 yixing r 08/27/2007 11:30:29 all.q@compute-0-22.local 573738 0.50830 LDA6_1000 yixing r 09/04/2007 20:33:41 all.q@compute-0-23.local 573763 0.50830 LDA3_1200 yixing r 09/05/2007 11:21:26 all.q@compute-0-24.local 572074 0.50830 LDA5_1600 yixing r 08/27/2007 11:31:29 all.q@compute-0-25.local 573993 0.51432 v.10g.sump selfso r 09/06/2007 16:37:25 all.q@compute-0-26.local 573801 0.51432 v.10g.rm.t selfso r 09/05/2007 15:26:50 all.q@compute-0-28.local 573994 0.51432 v.go.sumpb selfso r 09/06/2007 16:38:10 all.q@compute-0-29.local 573814 0.51432 v.go.thr2. selfso r 09/05/2007 18:19:20 all.q@compute-0-3.local 573756 0.50830 QLOGIN yixing r 09/05/2007 00:46:12 all.q@compute-0-30.local 573532 0.51432 v.r0.thr2. selfso r 09/04/2007 12:55:49 all.q@compute-0-31.local 573979 0.51432 v.go.pbqcl selfso r 09/06/2007 10:53:47 all.q@compute-0-31.local 573982 0.50830 QLOGIN yixing r 09/06/2007 13:29:10 all.q@compute-0-32.local 573813 0.51432 v.go.rm.tr selfso r 09/05/2007 18:19:05 all.q@compute-0-4.local 572073 0.50830 LDA5_1400 yixing r 08/27/2007 11:31:29 all.q@compute-0-5.local 572058 0.50830 LDA3_1600 yixing r 08/27/2007 11:30:29 all.q@compute-0-6.local 572066 0.50830 LDA4_1600 yixing r 08/27/2007 11:30:29 all.q@compute-0-7.local 572050 0.50830 LDA2_1600 yixing r 08/27/2007 11:30:29 all.q@compute-0-8.local 572049 0.50830 LDA2_1400 yixing r 08/27/2007 11:30:29 all.q@compute-0-9.local
You can also monitor your jobs on the web. Unfortunately, this only works from machines in the Computer Science department that are not on Kaosnet. Type sydney.cs.umass.edu into the address bar on your browser to see the cluster main page. Ganglia will show you graphs of how the cluster is being used, and Cluster Top shows all of the processes running on the cluster. There is also a Job Queue page that is similar to the output of qstat.
I find the output of qstat hard to read quickly, so I wrote my own tool called tstat, which has output that looks like this:
--- Wait Times ---
0 minutes | 42 | ******************************************
10 minutes | 0 |
30 minutes | 0 |
1 hour | 0 |
3 hours | 0 |
6 hours | 4 | ****
1 day | 2 | **
Longer | 28 | ****************************
--- Running ---
ronb | 9 | *********
selfso | 11 | ***********
yixing | 13 | *************
--- NAS Load ---
nas-1-3 0.00
nas-1-2 0.07
nas-1-0 3.20
nas-1-1 0.00
The top section, under "Wait Times", is a histogram telling you about each running process in the system. The "0 minutes" line tells us that there are 42 processors idle right now. 4 processes have been running for between 3 and 6 hours, 2 processes have been running for between 6 hours and 1 day, and 28 processes have been running for more than 1 day.
The "Running" section tells us that 9 of the processes belong to ronb, 11 belong to selfso, and 13 belong to yixing.
The last section monitors the load average on the four disk servers. Right now, usually only nas-1-0 does much work. If the load on nas-1-0 gets high, it can cause the cluster to run very slowly.
The job scheduler on sydney is set up to be as fair as possible to everyone. I'll try to explain the basics of how it works here.
If Alice is the only person using the cluster, she can use all 70+ processors herself. If Bob comes along and starts submitting jobs, sydney will try to give Alice and Bob 35 processors each. However, sydney may try to give Bob 40 or 45 processors for a little while to compensate for the fact that Alice has been getting the whole benefit of the cluster until recently. Sometimes this works well, and sometimes not so well.
The most important thing to notice is that the job queue is not first in, first out (FIFO). Your own jobs are FIFO, but having a truly FIFO job queue is not fair to other users. Therefore, sydney tries to mix job execution so that everyone gets a chance. Because of this, don't feel timid about queuing up a week's worth of work on sydney. The scheduler usually figures out how to fairly allocate CPU time to everyone.
Unfortunately, sydney's scheduler has trouble with very long running jobs. Imagine if Alice queued her week's worth of work onto sydney and started using all 70 processors. Bob comes along and wants to use some processors too, but sydney will not stop a program from running on a processor until it completes. Therefore, Bob has to wait until some of Alice's jobs finish. Bob is effectively locked out of the cluster until Alice's current jobs complete.
Therefore, to be nice to your fellow students, try to make sure that each individual process you submit to sydney takes less than 2 hours to run. You can submit as many processes as you want (some students have been known to queue up thousands of jobs at a time) as long as each one takes less than 2 hours. We realize that some kinds of programs simply take longer than 2 hours to run. If that's the case, please only run a few of those jobs at once and make sure to leave some free processors for others.
Since you have over 70 processors at your disposal, it's good to think about how you can split up your work into different processing tasks. A job that would take 100 hours on your laptop might take 1 hour on sydney, but only if you can break it up into 200 independent pieces. Many of the algorithms we use in the lab can be split this way, though.
a = LOAD '/work/collections/queryLog' as (time, query, session, resultCount); b = GROUP a BY query; c = FOREACH b GENERATE group as query, COUNT(a);The first line loads the queries from the file queryLog. In the second line, the queries are grouped by the query field into nested tuples (more information on the Pig website). The final line counts the number of times each query appears. For a 1GB query log (15 million queries), this takes about 2 minutes to run on sydney. If you try doing something similar with the Unix sort command, you will quickly appreciate how much faster Pig is. The Pig/Galago combination currently isn't released yet. You can talk to me if you want a copy.