Project

The purpose of the project part of this course is to ground the presentations and discussions in something concete.  The project is broken into a large number of parts, but has two major stages.  In the first stage you will be carrying out a typical experimental design and setup to evaluate IR effectiveness.  In the second stage, you will extend that evaluation in a direction that is of interest to you -- and is approved by the professor.

Depending on what you choose as a project, you may need to run your final system on a CIIR computer.  Some of the data we have available is licensed to be used only on the CIIR computers and cannot be distributed elsewhere.  If you end up in that situation, we should be able to construct a small surrogate data set for use on your personal computer, and we may be able to let you use one of our clusters for running your final experiments on the larger data.  (It will depend on whether there are competing needs.)


Stage One - Basics

This stage of the project includes six parts.  In all honesty, the reason for having several parts is to ensure that your fellow students (not you, of course!) start working on the project well in advance of its deadline.  (It also means that some of the homeworks can tie into the project work.)  For that reason, very little is required of you at the various part deadlines.

Stage One projects must be done invidually unless you can provide a compelling reason for collaborating.  (Stage Two projects may be collaborative.)

 P1a (Stage 1, part a), define task.  The intent of the Stage One project is to run a variation on a classic "document retrieval" task.  Your job for this first part is to decide on what the data will be and what sorts of queries will be asked of the data.  For example, if you were building a Web search engine, your data might be HTML Web pages and the queries would be general interest.  If, though, you were building an image retrieval system, the data would be images and the queries might be sample images that represent the class of images you want. 

It is probably easiest to limit yourself to tasks that can be run on collections of documents we have handy.  Here are some statistics about collections that should be usable:

Name

#docs

#Mbs

#queries

other information

 AP news

 242,918

728 

more than 300 

newswire

 WSJ news

173,252 

509 

more than 300 

 
GALE news (various sources) 93,143 617 60+ semi-structured queries Documents are more heavily marked up than AP/WSJ
GALE English broadcast 1,117 14 60+

Output of a speech recognition system with very heavy markup.  This is probably too small, so would have to be augmented.

 GALE Chinese newswire

 57,211

2,000 

60+ 

Output of a statistical MT system with lots of markup

GALE Arabic newswire 101,511 3,000 60+

Output of a statistical MT system with lots of markup

GALE English blogs

26,919 

 

60+ 

blogs

 Enron

200,399 messages 

400 (compressed) 

 

Email collection gathered from Enron investigation.  Information about the corpus is available here.

 Gov2

22,794,000 

427,000 

150+ 

Web pages from *.gov

 Corel

5,000 

 

 

Images and keyword labels

SIGIR 1978-2002 

1,066 

953 

none 

IR research papers (in PDF?)

Variations and subsets of the collections are certainly usable.  Other collections that are similar to those may also be used; these just happen to be some that are readily available.  In general, collections with 100,000 or more documents are more interesting, though not always available.  If you have access to a comparably sized collection, please check with the professor to see whether it will be acceptable.  You are by no means limited to the collections listed above or to collections that the IR Lab has on hand.

While you are thinking about this part of the project, you may find it useful to start thinking about the Information Retrieval system that you'll use.  Part P1c has some additional details on that decision.

Hand in a very brief (half page) description of the collection you want to use and the task that you envision on that collection.

 P1b, get data.  For this part, your only task is to identify precisely which data you will be using and to get it to some place that you can access it.  If you're using a large collection on the CIIR machines, this means you should have an account set up and a pointer to the source of the collection.  If you're using your own computer, you should now have a copy of the data resident on your computer.  In either case, you should have a small subset (100 or so documents) in a place that you can readily access it.  (The reason for the subset is to have something for doing trial runs of the remaining parts of the project.)

Hand in a description of where the collection is, how big it is (number of documents and number of megabytes), and what you are using for a subset.

 P1c, clean data.  Convert the data into a format that is usable by the Information Retrieval system that you'll be using.  This may mean stripping characters that are not properly handled (is the system eight-bit clean?), ensuring that the HTML looks like compliant XML, adding XML tags to delimit fields, inserting document boundary markers, and so on.  What you need to do will depend on the system that you are using, so you'll need to read the documentation for that system to be pretty sure of getting the right format.

Here are a couple of plausible systems that you might use.  All are open-source.

  1. Indri, from UMass Amherst and Carnegie Mellon University.  This search engine is based on the Lemur toolkit, incorporates "best in class" capabilities, and can be used out of the "box".  It provides a very powerful and flexible query language.  It was largely developed by and is heavily used at UMass, so there is a reasonable collection of knowledgeable users locally.  Indri is written in C++.
  2. Lemur, also from UMass Amherst and Carnegie Mellon University.  Lemur is a powerful and flexible toolkit for building IR systems, particularly those that are built upon the language modeling framework.  Lemur will allow you to do almost anything, but requires programming to do more than the basics.  It is written in C++.
  3. Lucene, is now part of the Apache project.  It is written in Java, originally by Doug Cutting (at Yahoo and once at Excite).  Lucene is widely used as the search engine for Web application (e.g., it is the search system for Wikipedia).
  4. Terrier, from the University of Glasgow.  Written in Java.  Terrier has been used successfully in a number of TREC tracks. 
  5. Zettair, from the Search Engine Group at RMIT.  Zettair is written in C and is designed in particular to be simple, fast, and flexible when handling large amounts of text.

Hand in a description of what you did to clean the data, which search engine you're going to use and why it required that cleaning, and a sample document before and after cleaning.  (If the document is long, please crop it so that you are showing some examples of interesting changes and not pages and pages of the same thing over and over again.)

 P1d, index data.  Use your chosen retrieval system to index the collection and then try running a query of some sort to demonstrate that it actually works.  Ensure that the top-ranked documents look reasonable based on that query.

Hand in some sort of output that shows you indexed the collection and that illustrates the size of the indexed data.  This might be the output generated by the indexing component, or it might be a log file or parameter file that describes the result of the process.  You should also hand in the result of running your toy query and the titles of the top handful of documents (or enough to show that it did something reasonable).

Note that this is really just a sanity check to ensure that things are working and that you will be able to do the subsequent parts of the project.  You might look at part P1e for some more detailed discussion of the queries, but that is not critical.

 P1e, run queries.  Create queries in line with the task you defined in P1a.  Run the queries and show the output. 

You should create a minimum of 10 queries that are in line with the task that you have defined.  What the queries actually look like will depend on your task.  They might be complex Boolean expressions of an information need, they might be lengthy full-sentence statements of what is interesting, or they might be just a few important key words.  Whatever the queries look like, your system will interpret the queries in the light of the task that you've defined.  For example, if you had built a system that retrieved only documents about "shoes" then the query "hush puppy" would retrieve documents about that brand of shoe and not a recipe for deep fried cornmeal bread.

You will ultimately want to generate output that shoes...er, shows the top 100 or more documents in ranked order.  For purposes of this part, though, show the top 5-10 retrieved documents for each query.  If your collection has titles, just list the titles; if it does not, use something that is comparable.  Please do not dump the full text of 10 documents if you can reasonably avoid it.


 P1f, evaluation.  Judge some documents (or whatever is needed) and evaluate the results of your runs.  Decide how well your system addresses the task at hand and justify your evaluation based on that.

For each of your 10 (or more) queries, judge some of the documents.  It may be sufficient to look at precision in the top ranked documents, in which case you could get by looking at, say, the top 10 per query.  If you want to do a deeper analysis, you may have to do some exploring to find additional relevant documents.  For example, suppose that you wanted to calculate (or estimate) MAP.  In that case you need to do a reasonable job of trying to find all relevant documents so that you can estimate recall.  (If you're feeling bold, you can try to implement the "Minimal test collection" approach of Carterette et al.)

Given ranked lists and judgments, calculate evaluation measures for each query and then average them appropriately.  Present the individual scores as well as the overall average.  You may use a freely available tool such as trec_eval if that appeals to you, or you may write your own evaluation code.

Also provide a justification for why the evaluation you chose is the right one for this task.  For example, suppose that your task was a "known item" retrieval task--i.e., there is exactly one correct answer to any query.  In that case, MRR is probably reasonable, but F1 is probably not.

 P1g, presentation.    An in-class presentation of the results of the experiments you carried out.  You will have 7-8 minutes to present what you did, discuss how you did it, give an overview of the evaluation, and draw conclusions about what did or did not work.  More specifically, your presentaiton should include information about the task:

  1. What is your system designed to do?  It may be as simple as "retrieve newswire documents" or it might be more complicated.
  2. What corpus are you running queries against?
  3. What do queries look like?  Give some examples of turning a task into a query.
  4. How will you evaluate the results of running a query?
  5. What do your evaluation queries look like?  Give some examples.
  6. What are the evaluation results?  Present multiple measures if that's what is needed to convey a sense of what worked.
  7. What conclusions can you draw about this task and your system's ability to address the task?  Did you do the right evaluation?  What sort of evaluation might have been better?  What other things might you try to improve your system's ability to solve the problem?

Come to class prepared to give this quick presentation.  If you do not have a laptop, please contact the instructor to arrange to use his laptop for presentation.


 P1h, written report.  A written report detailing the results of the project.  On the order of 10 pages.  This is just a written version of part P1g.  No, it is not acceptable to present a series of PowerPointer slides.  However, your written report is likely to include any graphs or examples from the presentation.  And a good deal of the report will be consumed by the list of evaluation queries and the top few results for all or some of the queries. 

This part of the project is not about "how to write a really good project report".  Instead, it's about "putting together a decent summary of what you did". 

Your P1 grade will be created from P1h, with some input from P1g.  You must have completed all of the other parts, though, to get a passing grade on P1.


Stage Two - Advanced

The purpose of this stage of the project is to try out some more advanced Information Retrieval ideas, probably using the data and queries from the first stage.  Any project is fair game, provided it is related to the course material.

The Stage Two project may be done in collaboration with other students.  The more people who are involved in the project, the grander it should be, of course.

 P2a, proposal.  Write a 1-2 page proposal of what your Stage Two project will be.  You should indicate how it relates to class and make it clear how you can accomplish it in the time remaining in the semester.  If you are building on your Stage One project, please indicate how you will be doing that.  Examples of possible projects that might inspire ideas will appear later.

 P2b, presentation.  An in-class presentation of the results of your project one one of the last two days of class.  The length of the presentation will be dictated by the number of presentations in the class.

 P2c, report.  A written report that covers the material in the presentation in more detail.  It should include details of what you implemented, how experiments were designed and carried out, as well as evaluation results (if appropriate).  The report should run about 20 pages.