Introduction
Significance
Results
Interpretations
Downloads
|
Stemming and Cooccurrence on a Larger Corpus
Jeremy Pickens
Introduction
This web page details an extension of the work on stemming and corpus-based cooccurrence
analysis first explored by Xu/Croft in the following paper:
(1996), Xu, Jinxi and Croft, W.B.,
"Corpus-Based Stemming using Co-occurrence of Word Variants"
in ACM TOIS, Jan. 1998, vol. 16, no. 1, pp. 61-81,
Computer Science Technical Report TR96-67.
The basic idea behind this work is that we can use cooccurrence analysis of word
variants within a particular corpus to ascertain which variants belong together and
which do not. A stemmer, such as Porter or KStem, creates the initial word
variant (stem) classes.
For example, the Porter stemmer makes the following seemingly good conflations:
- bank banked banking bankings banks
- ocean oceaneering oceanic oceanics oceanization oceans
However, the Porter stemmer also makes the following bad conflations:
- polic polical polically police policeable policed policement policer policers
polices policial policically policier policiers policies policing policization
policize policly policy policying policys
- pun punative [sic] punned punning puns
Obviously, "policy"/"police" should not belong together, and neither should
"pun"/"punative". (Unless we're talking about punative damages
caused by a bad pun.) Corpus-based analysis of these word variants
should tell us this, since policy and police probably occur individually in the entire
corpus many more times than they occur together within the same window
within that corpus. Therefore, using an Expected Mutual Information
Measure (EMIM),
we can conclude that they probably aren't related. We use this
information to tame the initial "policy"/"police" stem class, creating two new classes:
- policies policy
- police policed policing
But what about "bank"/"banked"? When one talks
about "bank", one usually speaks of a financial institution. However,
one could also mean river bank, or a bank shot, such as in billiards.
When one speaks about "banked", it's more infrequently
that one uses it with financial instituations, and more frequently
that one is speaking of some projectile changing directions. However,
these are only my personal perceptions of these two words. It could be that,
for a particular corpus, my predictions/perceptions are completely
inaccurate.
Therefore, to get around this problem, we again
can use corpus-based cooccurrence statistics.
If bank and banked co-occur fairly frequently within
our particular corpus, we can conclude that, at least in this corpus,
they're probably related. They do not, so we're left with the
following stem class; banked is removed:
Significance of this Work
Xu and Croft's initial experiments with
corpus-based cooccurrence analysis were performed on collections up to
300 MB in size. While this produced good corpus-specific stemmers,
we wanted to see if a larger corpus would tell us anything different.
Perhaps with more statistics we could create a better stemmer,
avoiding
some of those conflations, such as "tum"/"tumor", that analysis on
a smaller corpus could not handle. Or,
with more statistics and a larger, heterogeneous set of documents,
we could at least create a more generalized
stemmer. For this most recent experiment, the entire 5.5 gigabyte TREC 1-5
collection was used to create the stem classes. The TREC 6 query
set was used to test the stem classes.
Results
Three sets of experiments were done, using initial
classes created by (1) the Porter stemmer, (2) K-Stem, and (3) the
Porter stemmer classes merged in a connected component manner with
the K-Stem classes.
Furthermore, a number of different thresholds were
used. After cooccurrence statistics had been gathered, a threshold
was used as a significant relationship cutoff point. Those word pairs
above the threshold were conflated, those below the threshold were
separated. Previous experiments had shown that a threshold of 0.01
with a corpus cooccurrence window of 100 words worked the best.
This, as well as other, less strict thresholds, were used. In the
interest of space, we only present a few of these thresholds, the
most interesting ones, here.
Experiments were done on the TREC 6 query set.
(A quick word on the following statistics: For
each particular stemmer [Porter, Kstem, Porter+Kstem], the first
column represents the unstemmed retrieval results on TREC 6, used
as the baseline. The second column is the stricter 0.01 threshold.
The third column is the looser 0.0001 threshold. Finally, the
last column is the fully stemmed classes, the original classes
created by the particular stemmer in question, untamed by corpus-based
cooccurrence analysis.)
| Porter Stemmer |
1. Unstemmed
2. Threshold_0.01
3. Threshold_0.0001
4. Porter_Stemmed
Queryid (Num): 50 50 50 50
Total number of documents over all queries
Retrieved: 50000 50000 50000 50000
Relevant: 4611 4611 4611 4611
Rel_ret: 2152 2497 2520 2518
Interpolated Recall - Precision Averages:
at 0.00 0.6738 0.7296 0.7095 0.7097
at 0.10 0.4622 0.5236 0.5405 0.5388
at 0.20 0.3844 0.4334 0.4387 0.4382
at 0.30 0.2627 0.3245 0.3362 0.3351
at 0.40 0.1866 0.2734 0.2803 0.2792
at 0.50 0.1314 0.2014 0.2105 0.2100
at 0.60 0.0982 0.1468 0.1584 0.1572
at 0.70 0.0598 0.0916 0.0912 0.0909
at 0.80 0.0263 0.0526 0.0522 0.0522
at 0.90 0.0106 0.0265 0.0273 0.0273
at 1.00 0.0039 0.0066 0.0080 0.0080
Average precision (non-interpolated) over all rel docs
0.1868 0.2332 0.2398 0.2392
% Change: 24.8 28.4 28.0
|
| KStem |
1. Unstemmed
2. Threshold_0.01
3. Threshold_0.0001
4. K_Stemmed
Queryid (Num): 50 50 50 50
Total number of documents over all queries
Retrieved: 50000 50000 50000 50000
Relevant: 4611 4611 4611 4611
Rel_ret: 2152 2500 2550 2549
Interpolated Recall - Precision Averages:
at 0.00 0.6738 0.7442 0.7085 0.7085
at 0.10 0.4622 0.5247 0.5213 0.5211
at 0.20 0.3844 0.4164 0.4299 0.4299
at 0.30 0.2627 0.3107 0.3186 0.3175
at 0.40 0.1866 0.2676 0.2773 0.2763
at 0.50 0.1314 0.2065 0.2157 0.2131
at 0.60 0.0982 0.1454 0.1587 0.1585
at 0.70 0.0598 0.0940 0.0974 0.0972
at 0.80 0.0263 0.0621 0.0644 0.0644
at 0.90 0.0106 0.0233 0.0252 0.0252
at 1.00 0.0039 0.0062 0.0067 0.0067
Average precision (non-interpolated) over all rel docs
0.1868 0.2297 0.2361 0.2356
% Change: 23.0 26.4 26.1
|
| Porter + KStem |
1. Unstemmed
2. Threshold_0.01
3. Threshold_0.0001
4. Porter+K_Stemmed
Queryid (Num): 50 50 50 50
Total number of documents over all queries
Retrieved: 50000 50000 50000 50000
Relevant: 4611 4611 4611 4611
Rel_ret: 2152 2528 2544 2545
Interpolated Recall - Precision Averages:
at 0.00 0.6738 0.7366 0.7028 0.7091
at 0.10 0.4622 0.5213 0.5339 0.5313
at 0.20 0.3844 0.4349 0.4392 0.4400
at 0.30 0.2627 0.3402 0.3512 0.3500
at 0.40 0.1866 0.2908 0.2999 0.2991
at 0.50 0.1314 0.2231 0.2299 0.2289
at 0.60 0.0982 0.1551 0.1688 0.1681
at 0.70 0.0598 0.0997 0.1114 0.1113
at 0.80 0.0263 0.0641 0.0648 0.0642
at 0.90 0.0106 0.0262 0.0259 0.0259
at 1.00 0.0039 0.0060 0.0062 0.0062
Average precision (non-interpolated) over all rel docs
0.1868 0.2394 0.2472 0.2466
% Change: 28.2 32.3 32.0
|
| Graph of Porter + KStem |
 |
Interpretations
For this collection, on this set of
queries, our cooccurrence analysis does not give us
much better average retrieval results than the fully stemmed, untamed
stem classes. This does not look entirely promising, until you
realize
that we're getting these equivalent results with much less work.
For example, looking at the Porter + K-Stem classes
(the merged version), we have a total of 59,534 stem classes (with
more
than one word per class), with an average 3.37 class size. When
you tame this using the looser 0.0001 threshold, this drops to 24,432
classes, with an average 3.21 class size. For the TREC 6 query set,
this meant that, using the 0.0001 threshold classes, we only had to
examine the inverted lists for roughly
half as many words, yet got basically the exact same retrieval
results as with the fully-stemmed classes.
Furthermore, when we increase the strictness,
and raise the threshold to 0.01, we only do slighty worse, on
average, than the looser threshold and the fully-stemmed classes.
However, if you look at the highest level of recall, precision
actually improves! The difference is slight, but it's consistent
across all our various stemmers: Porter, K-Stem, and the combination
of the two. Additionally, refering again to the Porter/K combination,
the 0.01 threshold creates 19,495 classes, with an average class size of 2.39.
This represents about a fourth as much work on the TREC 6 query set
as with the fully stemmed representation.
Keep in mind, however, that these results are only with one
collection. I haven't had time yet to test these stem classes on
other
collections. Hence, I present the classes themselves for your use.
I would only ask that, if you decide to test more than one stem class,
you contact the author with
your results.
Downloads
If you are interested in downloading/browsing these stem classes,
please
Click here.
|