Stemming and Cooccurrence on a Larger Corpus

Jeremy Pickens




This web page details an extension of the work on stemming and corpus-based cooccurrence analysis first explored by Xu/Croft in the following paper:

(1996), Xu, Jinxi and Croft, W.B., "Corpus-Based Stemming using Co-occurrence of Word Variants" in ACM TOIS, Jan. 1998, vol. 16, no. 1, pp. 61-81, Computer Science Technical Report TR96-67.

The basic idea behind this work is that we can use cooccurrence analysis of word variants within a particular corpus to ascertain which variants belong together and which do not. A stemmer, such as Porter or KStem, creates the initial word variant (stem) classes.

For example, the Porter stemmer makes the following seemingly good conflations:

* bank banked banking bankings banks
* ocean oceaneering oceanic oceanics oceanization oceans

However, the Porter stemmer also makes the following bad conflations:

* polic polical polically police policeable policed policement policer policers polices policial policically policier policiers policies policing policization policize policly policy policying policys
* pun punative [sic] punned punning puns

Obviously, "policy"/"police" should not belong together, and neither should "pun"/"punative". (Unless we're talking about punative damages caused by a bad pun.) Corpus-based analysis of these word variants should tell us this, since policy and police probably occur individually in the entire corpus many more times than they occur together within the same window within that corpus. Therefore, using an Expected Mutual Information Measure (EMIM), we can conclude that they probably aren't related. We use this information to tame the initial "policy"/"police" stem class, creating two new classes:

* policies policy
* police policed policing

But what about "bank"/"banked"? When one talks about "bank", one usually speaks of a financial institution. However, one could also mean river bank, or a bank shot, such as in billiards. When one speaks about "banked", it's more infrequently that one uses it with financial instituations, and more frequently that one is speaking of some projectile changing directions. However, these are only my personal perceptions of these two words. It could be that, for a particular corpus, my predictions/perceptions are completely inaccurate.

Therefore, to get around this problem, we again can use corpus-based cooccurrence statistics. If bank and banked co-occur fairly frequently within our particular corpus, we can conclude that, at least in this corpus, they're probably related. They do not, so we're left with the following stem class; banked is removed:

* bank banking banks

Significance of this Work

Xu and Croft's initial experiments with corpus-based cooccurrence analysis were performed on collections up to 300 MB in size. While this produced good corpus-specific stemmers, we wanted to see if a larger corpus would tell us anything different. Perhaps with more statistics we could create a better stemmer, avoiding some of those conflations, such as "tum"/"tumor", that analysis on a smaller corpus could not handle. Or, with more statistics and a larger, heterogeneous set of documents, we could at least create a more generalized stemmer. For this most recent experiment, the entire 5.5 gigabyte TREC 1-5 collection was used to create the stem classes. The TREC 6 query set was used to test the stem classes.


Three sets of experiments were done, using initial classes created by (1) the Porter stemmer, (2) K-Stem, and (3) the Porter stemmer classes merged in a connected component manner with the K-Stem classes.

Furthermore, a number of different thresholds were used. After cooccurrence statistics had been gathered, a threshold was used as a significant relationship cutoff point. Those word pairs above the threshold were conflated, those below the threshold were separated. Previous experiments had shown that a threshold of 0.01 with a corpus cooccurrence window of 100 words worked the best. This, as well as other, less strict thresholds, were used. In the interest of space, we only present a few of these thresholds, the most interesting ones, here.

Experiments were done on the TREC 6 query set.

(A quick word on the following statistics: For each particular stemmer [Porter, Kstem, Porter+Kstem], the first column represents the unstemmed retrieval results on TREC 6, used as the baseline. The second column is the stricter 0.01 threshold. The third column is the looser 0.0001 threshold. Finally, the last column is the fully stemmed classes, the original classes created by the particular stemmer in question, untamed by corpus-based cooccurrence analysis.)


For this collection, on this set of queries, our cooccurrence analysis does not give us much better average retrieval results than the fully stemmed, untamed stem classes. This does not look entirely promising, until you realize that we're getting these equivalent results with much less work.

For example, looking at the Porter + K-Stem classes (the merged version), we have a total of 59,534 stem classes (with more than one word per class), with an average 3.37 class size. When you tame this using the looser 0.0001 threshold, this drops to 24,432 classes, with an average 3.21 class size. For the TREC 6 query set, this meant that, using the 0.0001 threshold classes, we only had to examine the inverted lists for roughly half as many words, yet got basically the exact same retrieval results as with the fully-stemmed classes.

Furthermore, when we increase the strictness, and raise the threshold to 0.01, we only do slighty worse, on average, than the looser threshold and the fully-stemmed classes. However, if you look at the highest level of recall, precision actually improves! The difference is slight, but it's consistent across all our various stemmers: Porter, K-Stem, and the combination of the two. Additionally, refering again to the Porter/K combination, the 0.01 threshold creates 19,495 classes, with an average class size of 2.39. This represents about a fourth as much work on the TREC 6 query set as with the fully stemmed representation.

Keep in mind, however, that these results are only with one collection. I haven't had time yet to test these stem classes on other collections. Hence, I present the classes themselves for your use. I would only ask that, if you decide to test more than one stem class, you contact the author with your results.


If you are interested in downloading/browsing these stem classes, they are available as stemming.tar.