Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR

A Collaborative Project with Tufts University and the Internet Archive

James Allan, PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
allan@cs.umass.edu
http://www.cs.umass.edu/~allan

R. Manmatha, co-PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
manmatha@cs.umass.edu
http://www.cs.umass.edu/~manmatha

David Smith, co-PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
dasmith@cs.umass.edu
http://www.cs.umass.edu/~dasmith


Project Award Information

National Science Foundation Award Number: IIS - 0910884
Data-intensive Computing

Duration: 09/01/09 - 08/31/14


Project Summary


The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections.

To provide effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR
output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling  through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features.

When applied across large corpora, these technologies reinforce each other: improved topic modeling enables more targeted language models for OCR; extracting structural metadata improves citation analysis; and entity extraction improves topic modeling and query expansion.

The testbed for this project is the growing corpus of over one million open-access books from the Internet Archive.

Graduate Students Involved in this Project:

Publications:



This work is supported by the National Science Foundation (Award Number 0910884). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.