Skip to topic | Skip to bottom
Home
Main
Main.PivotedDocumentLengthNormalizationr1.2 - 27 Sep 2007 - 15:03 - JinyoungKimtopic end

Start of topic | Skip to actions
Date Place Author Keyword
1996 SIGIR Singhal,Buckley,Mitra document length

Summary

Collection-independent document normalization scheme that reduces the gap between relevance and the retrieval probabilities, which is effective in improving performance.

Background

Document Length Normalization

  • Reasons for DLN is higher term frequencies(1) and more terms(2).

  • Common methods of DLN (no. of issue addressed)
    • Cosine Normalization (1,2)
    • Maximum tf Normalization (1)
    • Byte Length Normalization (1,2)

Contribution

Pivoted Normalization Scheme

  • Problem : Cosine normalization tends to favor short documents, excessively penalizing longer documents, which is confirmed by test result over many varieties of collections.

  • Pivoted Normalization Scheme (+9-12% in avg. precision)
    • Correct the skew of existing normalization function using
    • Use average old normalization factor as the pivot

Pivoted Unique Normalization

  • Problem : Retrieval probability for long document are greater than their prob. of relevance when cosine normalization is used. Using pivoted norm. scheme makes it worse.
    • Prob. of match btw. a query and a document increases linearly in the number of different terms in a document, which should be addressed by normalization.
    • Cosine factor varies like (#unique terms)^0.6, being weaker than (#unique terms)

  • Pivoted Unique Normalization (+6% avg. precision over Pivoted Cosine)
    • Use (#unique terms) instead of cosine function (2)
    • Use average term frequency based normalization instead of max. term frequency based normalization(1)

Pivoted Byte-size Normalization (+15% in avg. precision)

  • Since OCR-read document can contain many noise-degraded version of a single word, use no. of bytes instead of no. of unique terms for degraded(e.g. OCR) text.

Comment

  • Statistical distribution can be a clue in judging the quality of data we got.
  • The ability to find the pattern from data is fundamental and significant.

Reference


to top

You are here: Main > Fall2007ReadingGroup > PivotedDocumentLengthNormalization

to top

Copyright © 1999-2008 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback