| Date | Place | Author | Keyword |
| 1996 | SIGIR | Singhal,Buckley,Mitra | document length |
Summary
Collection-independent document normalization scheme that reduces the gap between relevance and the retrieval probabilities, which is effective in improving performance.
Background
Document Length Normalization
- Reasons for DLN is higher term frequencies(1) and more terms(2).
- Common methods of DLN (no. of issue addressed)
- Cosine Normalization (1,2)
- Maximum tf Normalization (1)
- Byte Length Normalization (1,2)
Contribution
Pivoted Normalization Scheme
- Problem : Cosine normalization tends to favor short documents, excessively penalizing longer documents, which is confirmed by test result over many varieties of collections.
- Pivoted Normalization Scheme (+9-12% in avg. precision)
- Correct the skew of existing normalization function using
- Use average old normalization factor as the pivot
Pivoted Unique Normalization
- Problem : Retrieval probability for long document are greater than their prob. of relevance when cosine normalization is used. Using pivoted norm. scheme makes it worse.
- Prob. of match btw. a query and a document increases linearly in the number of different terms in a document, which should be addressed by normalization.
- Cosine factor varies like (#unique terms)^0.6, being weaker than (#unique terms)
- Pivoted Unique Normalization (+6% avg. precision over Pivoted Cosine)
- Use (#unique terms) instead of cosine function (2)
- Use average term frequency based normalization instead of max. term frequency based normalization(1)
Pivoted Byte-size Normalization (+15% in avg. precision)
- Since OCR-read document can contain many noise-degraded version of a single word, use no. of bytes instead of no. of unique terms for degraded(e.g. OCR) text.
Comment
- Statistical distribution can be a clue in judging the quality of data we got.
- The ability to find the pattern from data is fundamental and significant.
Reference
to top