Skip to topic | Skip to bottom
Home
ACE
ACE.CitationExtractionr1.5 - 02 Jun 2004 - 04:36 - AronCulottatopic end

Start of topic | Skip to actions
This goal of this project is to build a model which segments paper citations and performs coreference between papers, venues, and authors in a unified model. we build on the work of Wellner et al (http://www.cs.umass.edu/~wellner/....? ) UAI'04

Javadocs

AronCulotta - 19 May 2004:

Notes:

  • CharlesSutton skillfully ported and refactored FuchunPeng? 's and BenWellner? 's segmentation and coreference code from their users directories to projects/seg_plus_coref

  • Rewrote SGMLStringOperation to handle tag attributes

  • Performed running room tests to see the usefulness of venue clusters. I added a feature that is true if two papers are in the same venue cluster, and another that's true if two papers have the same venue_vol. Expts assume perfect segmentation. Results are in /usr/col/tmp1/culotta/mallet/exp/cite/5.19/{venue.log, novenue.log}

Summary: Venue coref increases cluster precision dramatically; recall, not so much.

  Pr Re F1
no venue 91.9 98.5 95.1
venue 97.8 98.6 98.2

Next steps:

  • run trial using CRF output
  • train classifier for venue coref
  • co-cluster venues and papers

AronCulotta - 21 May 2004:

Obtained baselines for venue coreference (coreference/VenueCoreference.java). Adapted Ben's code to do venue coreference (independent of paper coreference). See coreference/CitationUtils.java. Again, we assume perfect segmentation.

  • base: same features as paper coref
  • jn-bt-acr: base + one feature for approx journal match and one for booktitle match + acronym feature
  • venue-acr: base + one feature for "venue" approx match + acronym
  • all: base + jn-bt-acr + venue
  Pr Re F1
base 98.7 52.7 68.7
jn-bt-acr 97.8 75.8 85.4
venue-acr 97.5 78.4 86.9
all 97.7 74.3 84.4

Still need to run with noisy CRF output. Then implement joint clustering of papers and venues. How to pass clustering info to segmentation??

CharlesSutton 25 May 2004

I have gotten annoyed enough to start a ListOfBogusCiteseerLabelings.

AronCulotta 2 June 2004

Some results with the "Conditional Link" clustering method:

Training data is created by sampling from clusters ("sampled") or by generating one way to cluster the data ("generated").

Testing is done either by randomly choosing nodes to classify with the existing clusters ("random"), or by choosing the node whose decision we're most confident about first ("greedy").

So far, it seems that the average of 5 "random" trials trained on "generated" data performs better than a "greedy" method on "generated" data. This isn't entirely unexpected, as it's hard to get stuck in a local minimum if you're not being greedy.

Preliminary experiments with "sampled" data haven't improved things for either "random" or "greedy." This seems strange, since we have such a large amount of training data. Idea: active learning type approach where we over-sample decisions we're uncertain about.

Also, the sampling performance seems very sensitive to the proportion of positive and negative samples chosen. Currently, .1 - .01 pos/neg ration does best. Is there a better way to set this? Shouldn't it reflect the ratio we'll see at test time??

Trials:

CLRG - Conditional Link - "random", "generated"

CLGG - Conditional Link - "greedy", "generated"

CLRS10 - Conditional Link - "random", "sampled" 10K instances

CLGS10 - Conditional Link - "greedy", "sampled" 10K instances

BEN - Ben's clustering results, reported in UAI

BER - Berkely's clustering results, also reported in UAI paper

The goal is to beat Ben's UAI and Berkely results on clustering citeseer data. Here's where we are:

Pairwise Eval

  F1 Prec Rec
CLRG .964 .960 .968
CLGG      
CLRS10      
CLGS10      
BEN      

Cluster Eval

  F1 Prec Rec
CLRG .921 .912 .932
CLGG      
CLRS10      
CLGS10      
BEN     .947
BER     .94

Next: Try clustering venues and papers simultaneously in this framework.
to top


You are here: ACE > CitationExtraction

to top

Copyright © 1999-2008 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback