Research Statement
Tom Kalt


In recent years, statistical natural language processing has relied heavily on tools from machine learning and information theory. My thesis work is an example of this trend. However, the time is ripe to look to linguistics, psycholinguistics, and cognitive science for ways to improve NLP systems. My research interests center around two questions:
  • How can we design computer programs that process human language in useful ways?
  • How do humans process language, and how do they learn to do it?

    In my Ph.D. thesis, I applied the framework of learning control policies for sequential decision problems to Penn Treebank parsing. The theory of Markov decision processes and reinforcement learning provides a rich set of tools which can be fruitfully applied to syntactic analysis, both for learning and for processing. This control-oriented view naturally leads to probabilistic models that are conditional rather than joint. NLP components designed this way can be very fast.

    My thesis work raised numerous questions which deserve further investigation, including:

  • How can lexical syntax and lexical semantics be best exploited in making parsing decisions?
  • How do modules of NLP systems interact (e.g. tagging, chunking, and parsing, or parsing and semantic role labeling)?
  • How can insights from psycholinguistics and cognitive science improve the design of NL parsing programs?

    At present, most statistical NLP systems rely primarily on supervised learning. However, other learning architectures have been studied, such as semi-supervised, unsupervised, and active learning. There are many open questions about how to apply these approaches. The use of reinforcement learning in NLP is still uncharted territory. These architectures have great potential practical importance, as they can reduce the cost of developing or retargeting a system. Learning architectures is an area where cognitive science may provide important ideas.

    I am interested in finding connections between "academic" research problems and commercially important applications, such as information retrieval, text classification, machine translation, or question answering. Identifying and exploiting these connections will benefit the field of NLP in many ways. Operational systems can be fruitful sources of both data and research problems. Commercial applications can also provide benefits in the area of evaluation. One of the unsatisfying aspects of Penn Treebank-style parsing research is the artificiality of the evaluation measures, which essentially treat all errors as equally important. Studying NLP systems in the context of real-world applications may provide a much better basis for evaluating performance of components such as parsers.