Homework two (HW2) 

See the course schedule for the due date.
This homework is worth 100 points, 20 points per question.

Errata/clarifications:

  1. (Oct 1, 11:40am) Problem A has been modified to indicate that there are 6 known relevant documents rather than 5.  The list of 6 remains unchanged.  (Previously the problem said there were 5 when there were 6 listed.)
  2. (Oct 1, 1:50pm) Problem B.1 asked that you calculate average precision but there is not sufficient information.  The problem has been modified, asking that you show how you would calculate average precision.
  3. (Oct 2, 10am) For Problem D, you should include every character in your calculations (with the exeption of the newline/space issue).  In particular, you should not be stripping punctuation as you did in Problem C.
  4. (Oct 2, 10am) For Problem E, the meaning of the table columns is provided now.

Problem A.  Evaluation "busy work"

Suppose a retrieval system ranks a set of 50 documents and the 6 known relevant documents appear at the following ranks:

1, 2, 5, 10, 22, 42

First plot an exact recall/precision graph (recall on the X axis) and then overlay it with a graph where the precision values are interpolated to the standard 11 points. 

Then calculate the following evaluation measures for that ranked list or indicate that there is not sufficient information to calculate it:

  1. Precision at rank 10
  2. Precision when recall is 50%
  3. Uninterpolated average precision
  4. 11-point interpolated average precision
  5. Precision when recall is 25%
  6. Uninterpolated average F1

Problem B. Incomplete evaluation

  1. Suppose that a system outputs a ranked list of 100 documents from a collection of 1000 documents, and that you have judgments that tell you which of those are relevant.  That means that you can calculate precision at 100 and so on.  However, suppose that there are two additional relevant documents that did not appear in the top 100.  You can still calculate precision at 100, but you do not know where in a longer ranked list those additional two documents would have appeared so you cannot calculate average precision (for example).  Here are three possible ways to address that problem.  Show how you would calculate average precision using each approach and then argue which is best (in your opinion).

    1. Assume the relevant documents would have been ranked immediately after the top 100.
    2. Assume the documents would have appeared at the very end of the list. 
    3. Assume the documents appeared at rank "infinity" (i.e., nowhere in the ranked list).  Note that in this case you cannot pretend they do not exist!
  2. In the pooling approach to relevance judging, the top K documents (e.g., 100) from a variety of systems are pooled and documents in that pool are judged.  Any documents outside the pool are assumed to be non-relevant.  Obviously that assumption is invalid: there will be some documents that are relevant but that no participating system happened to find. 

    1. When we assume that unjudged documents are non-relevant, are we under-estimating or over-estimating average precision?  If it depends on the situation, what does it depend upon? Justify your answer.
    2. Suppose that after the top K documents were all judged, a single new relevant document was identified.  That document was obviously not in the top K for any participating run so will not affect "precision at J documents" unless J is greater than K.  Characterize the impact that document's judgment will have on average precision.  If it's useful, you may assume that K=100 and that the collection is very large.

Problem C. Real world Zipf

Download the text of On the Origin of Species (6th Edition) from Project Guttenberg (http://www.gutenberg.org/wiki/Main_Page). Be sure you get etext number 2009 and not one of the other versions.

Write a program or script that tokenizes the file into "words." A word is any string of letters and/or numbers that is separated by sequences of white space or punctuation marks (the punctuation marks are then discarded). The exceptions are that apostrophes and hyphens should be deleted and treated as if they weren't there. For example,

I didn't say, "my mother-in-law is visiting." Moses' 10,000 relatives/friends did.

would result in the tokens:

I didnt say my motherinlaw is visiting Moses 10 000 relatives friends did

Calculate the total number of tokens in the file. Convert all tokens to lower case, count the frequency of each unique token, and print out the following information for the 50 most frequent tokens:

  1. The token string itself
  2. The rank of this token, where the most common token is rank 1, the next is rank 2, and so on. Pretend there are no ties if there are (i.e., you'll have the numbers 1 to 50 here).
  3. The number of times the token appeared
  4. That number (#3) divided by the total number of tokens: i.e., P(token)
  5. That number (#4) times the rank (#2)

You may write this program or script in any language or using any tool that suits you. Do not hand in the program, just the 50 tokens' worth of information.

For extra credit, generate that information for at least the top 1000 terms and plot a graph with the rank of the term on the X axis and its frequency (or probability) on the Y axis.  Then overlay that with the expected Zipf distribution.

Problem D. Entropy

Modify your program/script from Problem C to count the frequencies of characters in the text (don't forget spaces; treat a line break as if it were a space).  (Count every character; do not omit punctuation like you did in Problem C.) How many unique characters did you find, and what is the total number of character occurrences?  (You ought to be able to confirm the last with wc or some similar utility.)  As a sanity check, list how many times the vowels occurred in the text.

What is the expected number of bits needed to encode a character in that text (i.e., the character-level entropy in the text)?

Problem E. Weighting "busy work"

The following table lists some statistics for several words.

 

tf

|D|

cf

df

avg|D| N

word1

 10

 100

10000 

 500

300 90000

word2

 1

 100

 10000

 500

300 90000

word3

 10

 1000

 10000

 500

300 90000

word4

 5

 100

 10000

 500

300 90000

word5

5

 100

10000

 10

300 90000
word6 5 100 10000 1000 300 90000
word7 5 100 500 500 300 90000
word8 5 100 2000 500 300 90000
word9 5 100 10000 500 300 90000

where "tf" is the number of times the word occurs in the document, "|D|" is the length of the document, "cf" is the number of times the word occurs in the collection, "avg|D|" is the average length of documents in the collection, and "N" is the total number of words in the collection.

Using those statistics, calculate the following for each word:

  1. Okapi tf
  2. IDF - use the simple log(N/n) version for this exercise 
  3. Okapi tf times IDF
  4. Maximum likelihood estimate of P(w|D)
  5. Jelinek-Mercer estimate of P(w|D) with lambda=0.7
  6. Jelinek-Mercer estimate of P(w|D) with lambda=0.3
  7. Dirichlet estimate of P(w|D) with mu=1000
  8. Dirichlet estimate of P(w|D) with mu=10

(You'll probably find this easiest using a spreadsheet or some approximation of one.  Trying to do all of these by hand sounds like a very bad idea.)

Now that you've got those, what do you notice about the results?  In particular, what happens to the different approaches as the statistics shift?  Feel free to try other variations just for comparison.