NSF ITR: Machine Learning for Sequences and Structured Data: Tools for Non-Experts

Principal Investigators:

Andrew McCallum
University of Massachusetts Amherst

Fernando Periera
University of Pennsylvania

John Lafferty
Carnegie Mellon University

In this collaborative research project between UMass Amherst, UPenn, and CMU, the team is researching ways to dramatically improve the ability of people who are not experts in machine learning to design and automatically train models for analyzing and transforming sequences and other structured data such as text, signals, handwriting, and biological sequences. Working in the context of recent successes with conditional random fields (CRFs) and other conditional models of structured data, this work will achieve its goals through scientific advances in model definition and combination, robust parameter estimation, and data-efficient training procedures, supported by an innovative compositional software architecture.

Automatic classification and function fitting (regression) are mature techniques used in many scientific, engineering, and national security applications. Free and commercial software packages allow novice users to train with reasonable confidence of attaining useful results, and to use these results in sophisticated predictive and decision-making systems. However, for sequence and structured data problems, while task-specific machine-learning software has become increasingly available (e.g., speech recognition and genomics software tools), there are no modular, easy-to-use packages for domain experts working in science and engineering problems such as predictive data modeling of spatio-temporal patterns of brain activity or extraction of conceptual and citation networks from the scientific literature, or applications in national and homeland security such as the extraction of social networks from the openWeb, or dynamically recognizing suspicious temporal patterns in network logs.

Researchers in this project will make available for the first time a user-oriented software toolkit for integrated analysis and transformation of sequential and graph-structured data, which will be a major innovation in the data, models, and communications technical focus area. What makes this possible is the convergence of three scientific innovations in learning from structured data. First, powerful, trainable analyzers and transformers for sequences and other structured data can be built by combining simpler conditional models with general composition and product operations based on the theory of weighted automata. Second, a range of capacity-control techniques (feature induction, margin maximization, Bayesian automatic relevance determination) can be generalized to these complex models to control overfitting without the need for extensive hyperparameter adjustments. Third, the need for fully annotated training data can be reduced by combining partial evidence for multiple sequences into a single graph labeling problem. Together, these ideas provide a framework for flexibly specifying and learning parameters for multistep probabilistic transformations from complex data to their structured representations, and for designing and documenting efficient and usable software for model specification, training, combination, and application to real-world data.

The successful realization of this research will shorten by five to ten years the usual time frame for such complex technology to have real impact in science, engineering, and national security. Progress in science and engineering increasingly demands more effective processing and combination of multiple data sources, ranging from the stored results of high-throughput analyzers to the rapidly growing mass of electronically available literature. The tools developed under this project will enable researchers to create software components for recognizing, extracting, and cross-linking patterns in sequential and graph-structured data. Security analysts will also benefit from the availability of the proposed tools for extracting relevant information from a variety of data streams, ranging from textual messages to network event logs.

This project is supported in part by National Science Foundation under NSF grant #IIS-0427594.