Project
The purpose of the project part of this course is to ground the presentations and discussions in something concete. The project is broken into a large number of parts, but has two major stages. In the first stage you will be carrying out a typical experimental design and setup to evaluate IR effectiveness. In the second stage, you will extend that evaluation in a direction that is of interest to you -- and is approved by the professor.
Depending on what you choose as a project, you may need to run your final system on a CIIR computer. Some of the data we have available is licensed to be used only on the CIIR computers and cannot be distributed elsewhere. If you end up in that situation, we should be able to construct a small surrogate data set for use on your personal computer, and we may be able to let you use one of our clusters for running your final experiments on the larger data. (It will depend on whether there are competing needs.)
This stage of the project includes six parts. In all honesty, the reason for having several parts is to ensure that your fellow students (not you, of course!) start working on the project well in advance of its deadline. (It also means that some of the homeworks can tie into the project work.) For that reason, very little is required of you at the various part deadlines.
Stage One projects must be done invidually unless you can provide a compelling reason for collaborating. (Stage Two projects may be collaborative.)
P1a (Stage 1, part a), define
task. The intent of the Stage One project is to run a variation
on a classic "document retrieval" task. Your job for this first part is to
decide on what the data will be and what sorts of queries will be asked of the
data. For example, if you were building a Web search engine, your data
might be HTML Web pages and the queries would be general interest. If,
though, you were building an image retrieval system, the data would be images
and the queries might be sample images that represent the class of images you
want.
It is probably easiest to limit yourself to tasks that can be run on
collections of documents we have handy. Here are some statistics about
collections that should be usable:
|
Name |
#docs |
#Mbs |
#queries |
other information |
|
AP news |
242,918 |
728 |
more than 300 |
newswire |
|
WSJ news |
173,252 |
509 |
more than 300 |
|
| GALE news (various sources) | 93,143 | 617 | 60+ semi-structured queries | Documents are more heavily marked up than AP/WSJ |
| GALE English broadcast | 1,117 | 14 | 60+ |
Output of a speech recognition system with very heavy markup. This is probably too small, so would have to be augmented. |
|
GALE Chinese newswire |
57,211 |
2,000 |
60+ |
Output of a statistical MT system with lots of markup |
| GALE Arabic newswire | 101,511 | 3,000 | 60+ |
Output of a statistical MT system with lots of markup |
|
GALE English blogs |
26,919 |
|
60+ |
blogs |
|
Enron |
200,399 messages |
400 (compressed) |
|
Email collection gathered from Enron investigation. Information about the corpus is available here. |
|
Gov2 |
22,794,000 |
427,000 |
150+ |
Web pages from *.gov |
|
Corel |
5,000 |
|
|
Images and keyword labels |
|
SIGIR 1978-2002 |
1,066 |
953 |
none |
IR research papers (in PDF?) |
Variations and subsets of the collections are certainly usable. Other
collections that are similar to those may also be used; these just happen to be
some that are readily available. In general, collections with 100,000 or
more documents are more interesting, though not always available. If you
have access to a comparably sized collection, please check with the professor to
see whether it will be acceptable. You are by no means limited to the
collections listed above or to collections that the IR Lab has on
hand.
While you are thinking about this part of the project, you may find
it useful to start thinking about the Information Retrieval system that you'll
use. Part P1c has some additional details on that decision.
Hand in a very brief (half page) description of the collection you want to use and the task that you envision on that collection.
P1b, get data.
For this part, your only task is to identify precisely which data you will be
using and to get it to some place that you can access it. If you're using
a large collection on the CIIR machines, this means you should have an account
set up and a pointer to the source of the collection. If you're using your
own computer, you should now have a copy of the data resident on your
computer. In either case, you should have a small subset (100 or so
documents) in a place that you can readily access it. (The reason for the
subset is to have something for doing trial runs of the remaining parts of the
project.)
Hand in a description of where the collection is, how big it is (number of documents and number of megabytes), and what you are using for a subset.
P1c, clean
data. Convert the data into a format that is usable by the
Information Retrieval system that you'll be using. This may mean stripping
characters that are not properly handled (is the system eight-bit clean?),
ensuring that the HTML looks like compliant XML, adding XML tags to delimit
fields, inserting document boundary markers, and so on. What you need to
do will depend on the system that you are using, so you'll need to read the
documentation for that system to be pretty sure of getting the right format.
Here are a couple of plausible systems that you might use. All are open-source.
Hand in a description of what you did to clean the data, which search engine you're going to use and why it required that cleaning, and a sample document before and after cleaning. (If the document is long, please crop it so that you are showing some examples of interesting changes and not pages and pages of the same thing over and over again.)
P1d, index
data. Use your chosen retrieval system to index the collection
and then try running a query of some sort to demonstrate that it actually
works. Ensure that the top-ranked documents look reasonable based on that
query.
Hand in some sort of output that shows you indexed the collection and that
illustrates the size of the indexed data. This might be the output
generated by the indexing component, or it might be a log file or parameter file
that describes the result of the process. You should also hand in the
result of running your toy query and the titles of the top handful of documents
(or enough to show that it did something reasonable).
Note that this is
really just a sanity check to ensure that things are working and that you will
be able to do the subsequent parts of the project. You might look at part
P1e for some more detailed discussion of the queries, but that is not
critical.
P1e, run
queries. Create queries in line with the task you defined in
P1a. Run the queries and show the output.
You should create a
minimum of 10 queries that are in line with the task that you have
defined. What the queries actually look like will depend on your
task. They might be complex Boolean expressions of an information need,
they might be lengthy full-sentence statements of what is interesting, or they
might be just a few important key words. Whatever the queries look like,
your system will interpret the queries in the light of the task that you've
defined. For example, if you had built a system that retrieved only
documents about "shoes" then the query "hush puppy" would retrieve documents
about that brand of shoe and not a recipe for deep fried cornmeal
bread.
You will ultimately want to generate output that shoes...er, shows
the top 100 or more documents in ranked order. For purposes of this part,
though, show the top 5-10 retrieved documents for each query. If your
collection has titles, just list the titles; if it does not, use something that
is comparable. Please do not dump the full text of 10 documents if you can
reasonably avoid it.
P1f,
evaluation. Judge some documents (or whatever is needed) and
evaluate the results of your runs. Decide how well your system addresses
the task at hand and justify your evaluation based on that.
For each of
your 10 (or more) queries, judge some of the documents. It may be
sufficient to look at precision in the top ranked documents, in which case you
could get by looking at, say, the top 10 per query. If you want to do a
deeper analysis, you may have to do some exploring to find additional relevant
documents. For example, suppose that you wanted to calculate (or estimate)
MAP. In that case you need to do a reasonable job of trying to find all
relevant documents so that you can estimate recall. (If you're feeling
bold, you can try to implement the "Minimal test collection" approach of
Carterette et al.)
Given ranked lists and judgments, calculate evaluation
measures for each query and then average them appropriately. Present the
individual scores as well as the overall average. You may use a freely
available tool such as trec_eval if that appeals to you, or you may write your
own evaluation code.
Also provide a justification for why the evaluation
you chose is the right one for this task. For example, suppose that your
task was a "known item" retrieval task--i.e., there is exactly one correct
answer to any query. In that case, MRR is probably reasonable, but F1 is
probably not.
P1g, presentation. An in-class
presentation of the results of the experiments you carried out. You will
have 7-8 minutes to present what you did, discuss how you did it, give an
overview of the evaluation, and draw conclusions about what did or did not
work. More specifically, your presentaiton should include information
about the task:
Come to class prepared to give this quick presentation. If you do not have a laptop, please contact the instructor to arrange to use his laptop for presentation.
P1h, written
report. A written report detailing the results of the
project. On the order of 10 pages. This is just a written version of
part P1g. No, it is not acceptable to present a series of PowerPointer
slides. However, your written report is likely to include any graphs or
examples from the presentation. And a good deal of the report will be
consumed by the list of evaluation queries and the top few results for all or
some of the queries.
This part of the project is not about "how to
write a really good project report". Instead, it's about "putting together
a decent summary of what you did".
Your P1 grade will be created
from P1h, with some input from P1g. You must have completed all of the
other parts, though, to get a passing grade on P1.
The purpose of this stage of the project is to try out some more advanced Information Retrieval ideas, probably using the data and queries from the first stage. Any project is fair game, provided it is related to the course material.
The Stage Two project may be done in collaboration with other students. The more people who are involved in the project, the grander it should be, of course.
P2a, proposal. Write a 1-2 page
proposal of what your Stage Two project will be. You should indicate how
it relates to class and make it clear how you can accomplish it in the time
remaining in the semester. If you are building on your Stage One project,
please indicate how you will be doing that. Examples of possible projects
that might inspire ideas will appear later.
P2b, presentation. An in-class
presentation of the results of your project one one of the last two days
of class. The length of the presentation will be dictated by the
number of presentations in the class.
P2c, report. A written report
that covers the material in the presentation in more detail. It should
include details of what you implemented, how experiments were designed and
carried out, as well as evaluation results (if appropriate). The report
should run about 20 pages.