Center for Intelligent Information Retrieval
Introduction
Significance
Results
Interpretations
Downloads

Stemming and Cooccurrence on a Larger Corpus
Jeremy Pickens

Introduction

This web page details an extension of the work on stemming and corpus-based cooccurrence analysis first explored by Xu/Croft in the following paper:

(1996), Xu, Jinxi and Croft, W.B., "Corpus-Based Stemming using Co-occurrence of Word Variants" in ACM TOIS, Jan. 1998, vol. 16, no. 1, pp. 61-81, Computer Science Technical Report TR96-67.

The basic idea behind this work is that we can use cooccurrence analysis of word variants within a particular corpus to ascertain which variants belong together and which do not. A stemmer, such as Porter or KStem, creates the initial word variant (stem) classes.

For example, the Porter stemmer makes the following seemingly good conflations:

  • bank banked banking bankings banks
  • ocean oceaneering oceanic oceanics oceanization oceans
However, the Porter stemmer also makes the following bad conflations:
  • polic polical polically police policeable policed policement policer policers polices policial policically policier policiers policies policing policization policize policly policy policying policys
  • pun punative [sic] punned punning puns

Obviously, "policy"/"police" should not belong together, and neither should "pun"/"punative". (Unless we're talking about punative damages caused by a bad pun.) Corpus-based analysis of these word variants should tell us this, since policy and police probably occur individually in the entire corpus many more times than they occur together within the same window within that corpus. Therefore, using an Expected Mutual Information Measure (EMIM), we can conclude that they probably aren't related. We use this information to tame the initial "policy"/"police" stem class, creating two new classes:

  • policies policy
  • police policed policing

But what about "bank"/"banked"? When one talks about "bank", one usually speaks of a financial institution. However, one could also mean river bank, or a bank shot, such as in billiards. When one speaks about "banked", it's more infrequently that one uses it with financial instituations, and more frequently that one is speaking of some projectile changing directions. However, these are only my personal perceptions of these two words. It could be that, for a particular corpus, my predictions/perceptions are completely inaccurate.

Therefore, to get around this problem, we again can use corpus-based cooccurrence statistics. If bank and banked co-occur fairly frequently within our particular corpus, we can conclude that, at least in this corpus, they're probably related. They do not, so we're left with the following stem class; banked is removed:

  • bank banking banks

Significance of this Work

Xu and Croft's initial experiments with corpus-based cooccurrence analysis were performed on collections up to 300 MB in size. While this produced good corpus-specific stemmers, we wanted to see if a larger corpus would tell us anything different. Perhaps with more statistics we could create a better stemmer, avoiding some of those conflations, such as "tum"/"tumor", that analysis on a smaller corpus could not handle. Or, with more statistics and a larger, heterogeneous set of documents, we could at least create a more generalized stemmer. For this most recent experiment, the entire 5.5 gigabyte TREC 1-5 collection was used to create the stem classes. The TREC 6 query set was used to test the stem classes.

Results

Three sets of experiments were done, using initial classes created by (1) the Porter stemmer, (2) K-Stem, and (3) the Porter stemmer classes merged in a connected component manner with the K-Stem classes.

Furthermore, a number of different thresholds were used. After cooccurrence statistics had been gathered, a threshold was used as a significant relationship cutoff point. Those word pairs above the threshold were conflated, those below the threshold were separated. Previous experiments had shown that a threshold of 0.01 with a corpus cooccurrence window of 100 words worked the best. This, as well as other, less strict thresholds, were used. In the interest of space, we only present a few of these thresholds, the most interesting ones, here.

Experiments were done on the TREC 6 query set.

(A quick word on the following statistics: For each particular stemmer [Porter, Kstem, Porter+Kstem], the first column represents the unstemmed retrieval results on TREC 6, used as the baseline. The second column is the stricter 0.01 threshold. The third column is the looser 0.0001 threshold. Finally, the last column is the fully stemmed classes, the original classes created by the particular stemmer in question, untamed by corpus-based cooccurrence analysis.)

Porter Stemmer
 1. Unstemmed
 2. Threshold_0.01
 3. Threshold_0.0001
 4. Porter_Stemmed

Queryid (Num):       50       50       50       50
Total number of documents over all queries
    Retrieved:    50000    50000    50000    50000
    Relevant:      4611     4611     4611     4611
    Rel_ret:       2152     2497     2520     2518
Interpolated Recall - Precision Averages:
    at 0.00       0.6738   0.7296   0.7095   0.7097 
    at 0.10       0.4622   0.5236   0.5405   0.5388 
    at 0.20       0.3844   0.4334   0.4387   0.4382 
    at 0.30       0.2627   0.3245   0.3362   0.3351 
    at 0.40       0.1866   0.2734   0.2803   0.2792 
    at 0.50       0.1314   0.2014   0.2105   0.2100 
    at 0.60       0.0982   0.1468   0.1584   0.1572 
    at 0.70       0.0598   0.0916   0.0912   0.0909 
    at 0.80       0.0263   0.0526   0.0522   0.0522 
    at 0.90       0.0106   0.0265   0.0273   0.0273 
    at 1.00       0.0039   0.0066   0.0080   0.0080 
Average precision (non-interpolated) over all rel docs
                  0.1868   0.2332   0.2398   0.2392 
    % Change:               24.8     28.4     28.0 

KStem
 1. Unstemmed
 2. Threshold_0.01
 3. Threshold_0.0001
 4. K_Stemmed

Queryid (Num):       50       50       50       50
Total number of documents over all queries
    Retrieved:    50000    50000    50000    50000
    Relevant:      4611     4611     4611     4611
    Rel_ret:       2152     2500     2550     2549
Interpolated Recall - Precision Averages:
    at 0.00       0.6738   0.7442   0.7085   0.7085 
    at 0.10       0.4622   0.5247   0.5213   0.5211 
    at 0.20       0.3844   0.4164   0.4299   0.4299 
    at 0.30       0.2627   0.3107   0.3186   0.3175 
    at 0.40       0.1866   0.2676   0.2773   0.2763 
    at 0.50       0.1314   0.2065   0.2157   0.2131 
    at 0.60       0.0982   0.1454   0.1587   0.1585 
    at 0.70       0.0598   0.0940   0.0974   0.0972 
    at 0.80       0.0263   0.0621   0.0644   0.0644 
    at 0.90       0.0106   0.0233   0.0252   0.0252 
    at 1.00       0.0039   0.0062   0.0067   0.0067 
Average precision (non-interpolated) over all rel docs
                  0.1868   0.2297   0.2361   0.2356
    % Change:               23.0     26.4     26.1 

Porter + KStem
 1. Unstemmed
 2. Threshold_0.01
 3. Threshold_0.0001
 4. Porter+K_Stemmed

Queryid (Num):       50       50       50       50
Total number of documents over all queries
    Retrieved:    50000    50000    50000    50000
    Relevant:      4611     4611     4611     4611
    Rel_ret:       2152     2528     2544     2545
Interpolated Recall - Precision Averages:
    at 0.00       0.6738   0.7366   0.7028   0.7091 
    at 0.10       0.4622   0.5213   0.5339   0.5313 
    at 0.20       0.3844   0.4349   0.4392   0.4400 
    at 0.30       0.2627   0.3402   0.3512   0.3500 
    at 0.40       0.1866   0.2908   0.2999   0.2991 
    at 0.50       0.1314   0.2231   0.2299   0.2289 
    at 0.60       0.0982   0.1551   0.1688   0.1681 
    at 0.70       0.0598   0.0997   0.1114   0.1113 
    at 0.80       0.0263   0.0641   0.0648   0.0642 
    at 0.90       0.0106   0.0262   0.0259   0.0259 
    at 1.00       0.0039   0.0060   0.0062   0.0062 
Average precision (non-interpolated) over all rel docs
                  0.1868   0.2394   0.2472   0.2466
    % Change:               28.2     32.3     32.0 
Graph of Porter + KStem

Interpretations

For this collection, on this set of queries, our cooccurrence analysis does not give us much better average retrieval results than the fully stemmed, untamed stem classes. This does not look entirely promising, until you realize that we're getting these equivalent results with much less work.

For example, looking at the Porter + K-Stem classes (the merged version), we have a total of 59,534 stem classes (with more than one word per class), with an average 3.37 class size. When you tame this using the looser 0.0001 threshold, this drops to 24,432 classes, with an average 3.21 class size. For the TREC 6 query set, this meant that, using the 0.0001 threshold classes, we only had to examine the inverted lists for roughly half as many words, yet got basically the exact same retrieval results as with the fully-stemmed classes.

Furthermore, when we increase the strictness, and raise the threshold to 0.01, we only do slighty worse, on average, than the looser threshold and the fully-stemmed classes. However, if you look at the highest level of recall, precision actually improves! The difference is slight, but it's consistent across all our various stemmers: Porter, K-Stem, and the combination of the two. Additionally, refering again to the Porter/K combination, the 0.01 threshold creates 19,495 classes, with an average class size of 2.39. This represents about a fourth as much work on the TREC 6 query set as with the fully stemmed representation.

Keep in mind, however, that these results are only with one collection. I haven't had time yet to test these stem classes on other collections. Hence, I present the classes themselves for your use. I would only ask that, if you decide to test more than one stem class, you contact the author with your results.

Downloads

If you are interested in downloading/browsing these stem classes, please Click here.
Thank you for visiting The Center for Intelligent Information Retrieval
Best viewed with Best Viewed
with Netscape!