<<O>>  Difference Topic PivotedDocumentLengthNormalization (r1.2 - 27 Sep 2007 - JinyoungKim)

META TOPICPARENT Fall2007ReadingGroup
Changed:
<
<
Date 1996 Author Singhal,Buckley,Mitra
Place SIGIR Keyword document length
>
>
Date Place Author Keyword
1996 SIGIR Singhal,Buckley,Mitra document length

Summary

Collection-independent document normalization scheme that reduces the gap between relevance and the retrieval probabilities, which is effective in improving performance.
 <<O>>  Difference Topic PivotedDocumentLengthNormalization (r1.1 - 19 Sep 2007 - JinyoungKim)
Line: 1 to 1
Added:
>
>
META TOPICPARENT Fall2007ReadingGroup
Date 1996 Author Singhal,Buckley,Mitra
Place SIGIR Keyword document length

Summary

Collection-independent document normalization scheme that reduces the gap between relevance and the retrieval probabilities, which is effective in improving performance.

Background

Document Length Normalization

  • Reasons for DLN is higher term frequencies(1) and more terms(2).

  • Common methods of DLN (no. of issue addressed)
    • Cosine Normalization (1,2)
    • Maximum tf Normalization (1)
    • Byte Length Normalization (1,2)

Contribution

Pivoted Normalization Scheme

  • Problem : Cosine normalization tends to favor short documents, excessively penalizing longer documents, which is confirmed by test result over many varieties of collections.

  • Pivoted Normalization Scheme (+9-12% in avg. precision)
    • Correct the skew of existing normalization function using
    • Use average old normalization factor as the pivot

Pivoted Unique Normalization

  • Problem : Retrieval probability for long document are greater than their prob. of relevance when cosine normalization is used. Using pivoted norm. scheme makes it worse.
    • Prob. of match btw. a query and a document increases linearly in the number of different terms in a document, which should be addressed by normalization.
    • Cosine factor varies like (#unique terms)^0.6, being weaker than (#unique terms)

  • Pivoted Unique Normalization (+6% avg. precision over Pivoted Cosine)
    • Use (#unique terms) instead of cosine function (2)
    • Use average term frequency based normalization instead of max. term frequency based normalization(1)

Pivoted Byte-size Normalization (+15% in avg. precision)

  • Since OCR-read document can contain many noise-degraded version of a single word, use no. of bytes instead of no. of unique terms for degraded(e.g. OCR) text.

Comment

  • Statistical distribution can be a clue in judging the quality of data we got.
  • The ability to find the pattern from data is fundamental and significant.

Reference

Revision r1.1 - 19 Sep 2007 - 17:36 - JinyoungKim
Revision r1.2 - 27 Sep 2007 - 15:03 - JinyoungKim