|
Stemming
and Cooccurrence on a Larger Corpus
Jeremy Pickens
Introduction
Significance
Results
Interpretations
Downloads
Introduction
This web page details an extension
of the work on stemming and corpus-based cooccurrence analysis first
explored by Xu/Croft in the following paper:
(1996), Xu, Jinxi and
Croft, W.B., "Corpus-Based
Stemming using Co-occurrence of Word Variants" in ACM TOIS,
Jan. 1998, vol. 16, no. 1, pp. 61-81, Computer Science Technical
Report TR96-67.
The basic idea behind this work
is that we can use cooccurrence analysis of word variants within
a particular corpus to ascertain which variants belong together
and which do not. A stemmer, such as Porter or KStem, creates the
initial word variant (stem) classes.
For example, the Porter stemmer
makes the following seemingly good conflations:
- bank banked banking bankings
banks
- ocean oceaneering oceanic oceanics
oceanization oceans
However, the Porter stemmer also
makes the following bad conflations:
- polic polical polically police
policeable policed policement policer policers polices policial
policically policier policiers policies policing policization
policize policly policy policying policys
- pun punative [sic] punned
punning puns
Obviously, "policy"/"police" should
not belong together, and neither should "pun"/"punative". (Unless
we're talking about punative damages caused by a bad pun.) Corpus-based
analysis of these word variants should tell us this, since policy
and police probably occur individually in the entire corpus many
more times than they occur together within the same window within
that corpus. Therefore, using an Expected Mutual Information Measure
(EMIM), we can conclude that they probably aren't related. We use
this information to tame the initial "policy"/"police" stem class,
creating two new classes:
- policies policy
- police policed policing
But what about "bank"/"banked"?
When one talks about "bank", one usually speaks of a financial institution.
However, one could also mean river bank, or a bank shot, such as
in billiards. When one speaks about "banked", it's more infrequently
that one uses it with financial instituations, and more frequently
that one is speaking of some projectile changing directions. However,
these are only my personal perceptions of these two words. It could
be that, for a particular corpus, my predictions/perceptions are
completely inaccurate.
Therefore, to get around this
problem, we again can use corpus-based cooccurrence statistics.
If bank and banked co-occur fairly frequently within our particular
corpus, we can conclude that, at least in this corpus, they're probably
related. They do not, so we're left with the following stem class;
banked is removed:
Significance
of this Work
Xu and Croft's
initial experiments with corpus-based cooccurrence analysis were
performed on collections up to 300 MB in size. While this produced
good corpus-specific stemmers, we wanted to see if a larger corpus
would tell us anything different. Perhaps with more statistics we
could create a better stemmer, avoiding some of those conflations,
such as "tum"/"tumor", that analysis on a smaller corpus could not
handle. Or, with more statistics and a larger, heterogeneous set
of documents, we could at least create a more generalized stemmer.
For this most recent experiment, the entire 5.5 gigabyte TREC 1-5
collection was used to create the stem classes. The TREC 6 query
set was used to test the stem classes.
Results
Three sets of
experiments were done, using initial classes created by (1) the
Porter stemmer, (2) K-Stem, and (3) the Porter stemmer classes merged
in a connected component manner with the K-Stem classes.
Furthermore, a
number of different thresholds were used. After cooccurrence statistics
had been gathered, a threshold was used as a significant relationship
cutoff point. Those word pairs above the threshold were conflated,
those below the threshold were separated. Previous experiments had
shown that a threshold of 0.01 with a corpus cooccurrence window
of 100 words worked the best. This, as well as other, less strict
thresholds, were used. In the interest of space, we only present
a few of these thresholds, the most interesting ones, here.
Experiments
were done on the TREC 6 query set.
(A quick word
on the following statistics: For each particular stemmer [Porter,
Kstem, Porter+Kstem], the first column represents the unstemmed
retrieval results on TREC 6, used as the baseline. The second column
is the stricter 0.01 threshold. The third column is the looser 0.0001
threshold. Finally, the last column is the fully stemmed classes,
the original classes created by the particular stemmer in question,
untamed by corpus-based cooccurrence analysis.)
|
Porter Stemmer
|
1. Unstemmed
2. Threshold_0.01
3. Threshold_0.0001
4. Porter_Stemmed
Queryid (Num): 50 50 50 50
Total number of documents over all queries
Retrieved: 50000 50000 50000 50000
Relevant: 4611 4611 4611 4611
Rel_ret: 2152 2497 2520 2518
Interpolated Recall - Precision Averages:
at 0.00 0.6738 0.7296 0.7095 0.7097
at 0.10 0.4622 0.5236 0.5405 0.5388
at 0.20 0.3844 0.4334 0.4387 0.4382
at 0.30 0.2627 0.3245 0.3362 0.3351
at 0.40 0.1866 0.2734 0.2803 0.2792
at 0.50 0.1314 0.2014 0.2105 0.2100
at 0.60 0.0982 0.1468 0.1584 0.1572
at 0.70 0.0598 0.0916 0.0912 0.0909
at 0.80 0.0263 0.0526 0.0522 0.0522
at 0.90 0.0106 0.0265 0.0273 0.0273
at 1.00 0.0039 0.0066 0.0080 0.0080
Average precision (non-interpolated) over all rel docs
0.1868 0.2332 0.2398 0.2392
% Change: 24.8 28.4 28.0
|
|
KStem
|
1. Unstemmed
2. Threshold_0.01
3. Threshold_0.0001
4. K_Stemmed
Queryid (Num): 50 50 50 50
Total number of documents over all queries
Retrieved: 50000 50000 50000 50000
Relevant: 4611 4611 4611 4611
Rel_ret: 2152 2500 2550 2549
Interpolated Recall - Precision Averages:
at 0.00 0.6738 0.7442 0.7085 0.7085
at 0.10 0.4622 0.5247 0.5213 0.5211
at 0.20 0.3844 0.4164 0.4299 0.4299
at 0.30 0.2627 0.3107 0.3186 0.3175
at 0.40 0.1866 0.2676 0.2773 0.2763
at 0.50 0.1314 0.2065 0.2157 0.2131
at 0.60 0.0982 0.1454 0.1587 0.1585
at 0.70 0.0598 0.0940 0.0974 0.0972
at 0.80 0.0263 0.0621 0.0644 0.0644
at 0.90 0.0106 0.0233 0.0252 0.0252
at 1.00 0.0039 0.0062 0.0067 0.0067
Average precision (non-interpolated) over all rel docs
0.1868 0.2297 0.2361 0.2356
% Change: 23.0 26.4 26.1
|
|
Porter + KStem
|
1. Unstemmed
2. Threshold_0.01
3. Threshold_0.0001
4. Porter+K_Stemmed
Queryid (Num): 50 50 50 50
Total number of documents over all queries
Retrieved: 50000 50000 50000 50000
Relevant: 4611 4611 4611 4611
Rel_ret: 2152 2528 2544 2545
Interpolated Recall - Precision Averages:
at 0.00 0.6738 0.7366 0.7028 0.7091
at 0.10 0.4622 0.5213 0.5339 0.5313
at 0.20 0.3844 0.4349 0.4392 0.4400
at 0.30 0.2627 0.3402 0.3512 0.3500
at 0.40 0.1866 0.2908 0.2999 0.2991
at 0.50 0.1314 0.2231 0.2299 0.2289
at 0.60 0.0982 0.1551 0.1688 0.1681
at 0.70 0.0598 0.0997 0.1114 0.1113
at 0.80 0.0263 0.0641 0.0648 0.0642
at 0.90 0.0106 0.0262 0.0259 0.0259
at 1.00 0.0039 0.0060 0.0062 0.0062
Average precision (non-interpolated) over all rel docs
0.1868 0.2394 0.2472 0.2466
% Change: 28.2 32.3 32.0
|
|
Graph of Porter + KStem
|
 |
Interpretations
For this collection,
on this set of queries, our cooccurrence analysis does not give
us much better average retrieval results than the fully stemmed,
untamed stem classes. This does not look entirely promising, until
you realize that we're getting these equivalent results with much
less work.
For example, looking
at the Porter + K-Stem classes (the merged version), we have a total
of 59,534 stem classes (with more than one word per class), with
an average 3.37 class size. When you tame this using the looser
0.0001 threshold, this drops to 24,432 classes, with an average
3.21 class size. For the TREC 6 query set, this meant that, using
the 0.0001 threshold classes, we only had to examine the inverted
lists for roughly half as many words, yet got basically the exact
same retrieval results as with the fully-stemmed classes.
Furthermore, when
we increase the strictness, and raise the threshold to 0.01, we
only do slighty worse, on average, than the looser threshold and
the fully-stemmed classes. However, if you look at the highest level
of recall, precision actually improves! The difference is
slight, but it's consistent across all our various stemmers: Porter,
K-Stem, and the combination of the two. Additionally, refering again
to the Porter/K combination, the 0.01 threshold creates 19,495 classes,
with an average class size of 2.39. This represents about a fourth
as much work on the TREC 6 query set as with the fully stemmed representation.
Keep in mind,
however, that these results are only with one collection. I haven't
had time yet to test these stem classes on other collections. Hence,
I present the classes themselves for your use. I would only ask
that, if you decide to test more than one stem class, you contact
the author
with your results.
Downloads
If you are interested in downloading/browsing
these stem classes, they are available under our downloads section as stemming.tar
Click here. |