Homework two
(HW2)
See the course schedule for
the due date.
This homework is worth 100 points, 20 points per
question.
Suppose a retrieval system ranks a set of 50 documents and the 6 known relevant documents appear at the following ranks:
1, 2, 5, 10, 22, 42
First plot an exact recall/precision graph (recall on the X axis) and
then overlay it with a graph where the precision values are interpolated to the
standard 11 points.
Then calculate the following evaluation
measures for that ranked list or indicate that there is not sufficient
information to calculate it:
Download the text of On the Origin of Species (6th Edition) from Project Guttenberg (http://www.gutenberg.org/wiki/Main_Page). Be sure you get etext number 2009 and not one of the other versions.
Write a program or script that tokenizes the file into "words." A word is any string of letters and/or numbers that is separated by sequences of white space or punctuation marks (the punctuation marks are then discarded). The exceptions are that apostrophes and hyphens should be deleted and treated as if they weren't there. For example,
I didn't say, "my mother-in-law is visiting." Moses' 10,000 relatives/friends did.
would result in the tokens:
I didnt say my motherinlaw is visiting Moses 10 000 relatives friends did
Calculate the total number of tokens in the file. Convert all tokens to lower case, count the frequency of each unique token, and print out the following information for the 50 most frequent tokens:
You may write this program or script in any language or using any tool that suits you. Do not hand in the program, just the 50 tokens' worth of information.
For extra credit, generate that information for at least the
top 1000 terms and plot a graph with the rank of the term on the X axis
and its frequency (or probability) on the Y axis. Then overlay that
with the expected Zipf distribution.
Modify your program/script from Problem C to count the frequencies of
characters in the text (don't forget spaces; treat a line break as if
it were a space). (Count every character; do not omit punctuation like you
did in Problem C.) How many unique characters did you find, and what is the
total number of character occurrences? (You ought to be able to confirm
the last with wc or some similar utility.) As a sanity check,
list how many times the vowels occurred in the text.
What is the expected number of bits needed to encode a character in that text
(i.e., the character-level entropy in the text)?
The following table lists some statistics for several words.
|
|
tf |
|D| |
cf |
df |
avg|D| | N |
|
word1 |
10 |
100 |
10000 |
500 |
300 | 90000 |
|
word2 |
1 |
100 |
10000 |
500 |
300 | 90000 |
|
word3 |
10 |
1000 |
10000 |
500 |
300 | 90000 |
|
word4 |
5 |
100 |
10000 |
500 |
300 | 90000 |
|
word5 |
5 |
100 |
10000 |
10 |
300 | 90000 |
| word6 | 5 | 100 | 10000 | 1000 | 300 | 90000 |
| word7 | 5 | 100 | 500 | 500 | 300 | 90000 |
| word8 | 5 | 100 | 2000 | 500 | 300 | 90000 |
| word9 | 5 | 100 | 10000 | 500 | 300 | 90000 |
(You'll probably find this easiest using a spreadsheet or some approximation
of one. Trying to do all of these by hand sounds like a very bad
idea.)
Now that you've got those, what do you notice about the
results? In particular, what happens to the different approaches as the
statistics shift? Feel free to try other variations just for
comparison.