Research Faculty Candidate Talk
Johns Hopkins University
The Center for Language and Speech Processing
Friday, March 7th, 2008
Computer Science Building, Room 151
Time: 11:00 a.m.
Faculty Host:
Bruce Croft
"Bootstrapping Monolingual Parsers from Multilingual Data"
The creation of the Penn Treebank and similar datasets ca. 1990
produced a flowering of research on empirically trained syntactic
parsers, which is now bearing fruit in information extraction and
machine translation. This revolution has bypassed most languages and
domains, however, due to the expense of creating treebanks.
Semi-supervised learning methods such as bootstrapping and cotraining
have the potential to leverage diverse sources of knowledge for robust
statistical parsing in these new settings.
Drawing on Abney's (2004) analysis of the Yarowsky algorithm, I
present a view of bootstrapping as optimization. This optimization is
performed with standard dynamic programming for projective syntax or
with a new model of graph spanning trees for non-projective syntax,
which allows trees with crossing dependency links in languages such as
Czech, Danish, and Dutch. Finally, I show how to draw features for a
parser in one language from parse trees in another language. These
quasi-synchronous grammars extend prior bootstrapping work with
synchronous grammars and also have applications in translation
modeling.
Bio:
David Smith is currently a Ph.D. student in Johns Hopkins University's
Computer Science Department and Center for Language and
Speech Processing and an NSF graduate fellow. He received his A.B. in
classics from Harvard University. His interests are in machine
translation, natural language parsing, and semi-supervised machine
learning methods. David was formerly head programmer for the Perseus
Digital Library Project at Tufts University, where he strayed from the
path of classical philology toward text mining, geocoding, and
information extraction.
to top