Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR

A Collaborative Project with Tufts University and the Internet Archive

James Allan, PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
allan@cs.umass.edu
http://www.cs.umass.edu/~allan

R. Manmatha, co-PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
manmatha@cs.umass.edu
http://www.cs.umass.edu/~manmatha

David Smith, co-PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
dasmith@cs.umass.edu
http://www.cs.umass.edu/~dasmith


Project Award Information

National Science Foundation Award Number: IIS - 0910884
Data-intensive Computing

Duration: 09/01/09 - 08/31/13


Project Summary


The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections.

To provide effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR
output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling  through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features.

When applied across large corpora, these technologies reinforce each other: improved topic modeling enables more targeted language models for OCR; extracting structural metadata improves citation analysis; and entity extraction improves topic modeling and query expansion.

The testbed for this project is the growing corpus of over one million open-access books from the Internet Archive.

Graduate Students Involved in this Project:

Publications:

  • IR-743: (2009) Balasubramanian, N., Kumaran, G. and Carvalho, V., "Predicting Query Performance on the Web," in the Proceedings of the 33rd Annual International ACM SIGIR Conference (SIGIR-2010), pp. 785-786.
  • IR-745: (2009) Yi, X. and Allan, J., "A Content based Approach for Discovering Missing Anchor Text for Web Search," in the Proceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010), pp. 427-434.
  • IR-747: (2009) Cartright, M., Seo, J. and Lease, M., "UMass Amherst and UT Austin @ The TREC 2009 Relevance Feedback Track ," in the Notebook Proceedings of the Text Retrieval Conference (TREC 2009) Gaithersburg, Maryland, USA, Nov 17-20, 2009.
  • IR-756: (2010) Feild, H., Allan, J. and Jones, R., "Predicting Searcher Frustration," in theProceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010), pp. 34-41, 2010.
  • IR-775: (2010) Balasubramanian, N. and Allan, J., "Learning to Select Rankers," in the Proceedings of the 33rd Annual ACM SIGIR Conference, pp. 855-856, 2010.
  • IR-776: (2010) Wu, X. and Smith, D., "Right-branching tree transformation for eager dependency parsing," CIIR Technical Report.
  • MM-791: (2010) Sankar, P., Jawahar, C. and Manmatha, R., "Nearest Neighbor based Collection OCR," in the Proceedings of the International Workshop on Document Analysis Systems (DAS 2010), pp. 207-214.
  • IR-802: (2011) Yi, X. and Allan, J., "Discovering Missing Click-through Query Language Information for Web Search," to appear in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), Glasgow, UK, October 24-28, 2011.
  • IR-803: (2011) Aktolga, E. and Allan, J., "Reranking Search Results for Sparse Queries," to appear in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), Glasgow, UK, October 24-28, 2011.
  • IR-808: (2011) Balasubramanian, N. and Allan, J., "Modeling Relative Effectiveness to Leverage Multiple Ranking Algorithms," CIIR Technical Report.
  • IR-809: (2011) Feild, H., Allan, J. and Glatt, J., "CrowdLogging: Distributed, private, and anonymous search logging," Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR'11), pp. 375-384.
  • IR-812: (2011) Cartright, M. and Allan, J., "Efficiency Optimizations for Interpolating Subqueries," to appear in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), Glasgow, UK, October 24-28, 2011.
  • IR-823: (2011) Dalton, J., Allan, J. and Smith, D., "Passage Retrieval for Incorporating Global Evidence in Sequence Labeling," to appear in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), Glasgow, UK, October 24-28, 2011.
  • IR-832: (2011) Krstovski, K. and Smith, D., "A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs," in the Proceedings of WMT11 — Sixth Workshop on Statistical Machine Translation, pp. 207-216.
  • IR-834: (2011) Yi, X., "Discovering and Using Implicit Data for Information Retrieval," Ph.D. thesis, May, 2011.
  • IR-843: (2011) Cartright, M., Feild, H. and Allan, J., "Evidence Finding using a Collection of Books," to appear in the Proceedings of The International Conference on Information and Knowledge Management BooksOnline workshop (BooksOnline'11), Glasgow, Scotland, UK, October 28, 2011.
  • IR-853: (2011) Smith, D., Manmatha, R. and Allan, J., "Mining Relational Structure from Millions of Books: Position Paper," to appear CIKM Books Online Workshop, Glasgow, Scotland, U.K., October 24, 2011.
  • MM-807: (2011) Yalniz, I., Can, E. and Manmatha, R., "Partial Duplicate Detection for Large Book Collections," to appear in the Proceedings of CIKM 2011, Glasgow, UK, Oct. 24, 28. 2011.
  • MM-818: (2011) Yalniz, I. and Manmatha, R., "A Fast Alignment Scheme for Automatic OCR Evaluation of Books," to appear in the Proceedings of The International Conference on Document Analysis and Recognition (ICDAR), Beijing, China, September 18-21, 2011.
  • MM-835: (2011) Jain, R., Frinken, V., Jawahar, C. and Manmatha, R., "BLSTM Neural Network based Word Retrieval for Hindi Documents," to appear in the Proceedings of the International Conference on Document Analysis and Recognition, (ICDAR'11) Beijing, China, September 18-21, 2011.



This work is supported by the National Science Foundation (Award Number 0910884).