Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR

A Collaborative Project with Tufts University and the Internet Archive

James Allan, PI
College of Information and Computer Sciences
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-9264
allan@cs.umass.edu
http://www.cs.umass.edu/~allan

R. Manmatha, co-PI
College of Information and Computer Sciences
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-9264
manmatha@cs.umass.edu
http://www.cs.umass.edu/~manmatha

David Smith, co-PI
College of Information and Computer Sciences
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-9264
dasmith@cs.umass.edu
http://www.cs.umass.edu/~dasmith

Project Award Information

National Science Foundation Award Number: IIS - 0910884
Duration: 09/01/09 - 08/31/16

Project Summary

The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections.

To provide effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features.

When applied across large corpora, these technologies reinforce each other: improved topic modeling enables more targeted language models for OCR; extracting structural metadata improves citation analysis; and entity extraction improves topic modeling and query expansion.

The testbed for this project is the growing corpus of over one million open-access books from the Internet Archive.

Graduate Students Involved in this Project:

Elif Aktolga
Niranjan Balasubramanian
Ethem Can
Marc Cartright
William Dabney
Jeff Dalton
Shiri Dori-Hacohen
Henry Feild
John Foley
Myung-ha Jang
Jiepu Jiang
Weize Kong
Kriste Krstovski
Pranav Mirajkar
Venkatesh Murthy
Patrick Verga
David Wemhoener
Xiaoye Wu
Ismet Zeki Yalniz
Xing Yi
Mao Zhao

Publications:

  • IR-743: (2009) Balasubramanian, N., Kumaran, G. and Carvalho, V., "Predicting Query Performance on the Web," in the Proceedings of the 33rd Annual International ACM SIGIR Conference (SIGIR-2010), pp. 785-786.
  • IR-745: (2009) Yi, X. and Allan, J., "A Content based Approach for Discovering Missing Anchor Text for Web Search," in the Proceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010), pp. 427-434.
  • IR-747: (2009) Cartright, M., Seo, J. and Lease, M., "UMass Amherst and UT Austin @ The TREC 2009 Relevance Feedback Track ," in the Notebook Proceedings of the Text Retrieval Conference (TREC 2009) Gaithersburg, Maryland, USA, Nov 17-20, 2009.
  • IR-756: (2010) Feild, H., Allan, J. and Jones, R., "Predicting Searcher Frustration," in theProceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010), pp. 34-41, 2010.
  • IR-775: (2010) Balasubramanian, N. and Allan, J., "Learning to Select Rankers," in the Proceedings of the 33rd Annual ACM SIGIR Conference, pp. 855-856, 2010.
  • IR-776: (2010) Wu, X. and Smith, D., "Right-branching tree transformation for eager dependency parsing," CIIR Technical Report.
  • IR-781: (2010) Cartright, M., Allan, J., Lavrenko, V. and McGregor, A., "Fast Query Expansion Using Approximations of Relevance Models," Proceedings of the Conference on Information and Knowledge Management (CIKM 2010), pp. 1573-1576.
  • MM-791: (2010) Sankar, P., Jawahar, C. and Manmatha, R., "Nearest Neighbor based Collection OCR," in the Proceedings of the International Workshop on Document Analysis Systems (DAS 2010), pp. 207-214.
  • IR-802: (2011) Yi, X. and Allan, J., "Discovering Missing Click-through Query Language Information for Web Search," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 153-162.
  • IR-803: (2011) Aktolga, E. and Allan, J., "Reranking Search Results for Sparse Queries," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 173-183.
  • IR-809: (2011) Feild, H., Allan, J. and Glatt, J., "CrowdLogging: Distributed, private, and anonymous search logging," Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR'11), pp. 375-384.
  • IR-812: (2011) Cartright, M. and Allan, J., "Efficiency Optimizations for Interpolating Subqueries," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 297-306.
  • IR-823: (2011) Dalton, J., Allan, J. and Smith, D., "Passage Retrieval for Incorporating Global Evidence in Sequence Labeling," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 355-364.
  • IR-832: (2011) Krstovski, K. and Smith, D., "A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs," in the Proceedings of WMT11 — Sixth Workshop on Statistical Machine Translation, pp. 207-216.
  • IR-834: (2011) Yi, X., "Discovering and Using Implicit Data for Information Retrieval," Ph.D. thesis, May, 2011.
  • IR-843: (2011) Cartright, M., Feild, H. and Allan, J., "Evidence Finding using a Collection of Books," in the Proceedings of The International Conference on Information and Knowledge Management BooksOnline workshop (BooksOnline'11), pp. 11-18.
  • IR-853: (2011) Smith, D., Manmatha, R. and Allan, J., "Mining Relational Structure from Millions of Books: Position Paper," CIKM Books Online Workshop, Glasgow, Scotland, U.K., 2011.
  • IR-866: (2011) Balasubramanian, N., "Query-Dependent Selection of Retrieval Alternatives," Ph.D. Dissertation, University of Massachusetts Amherst.
  • IR-872: (2012) Kim, J., Feild, H. and Cartright, M., "Understanding Book Search Behavior on the Web," Proceedings of the 21st ACM international conference on Information and knowledge management, (CIKM 2012), pp. 744-753.
  • IR-882: (2012) Feild, H., Cartright, M. and Allan, J., "The University of Massachusetts Amherst’s participation in the INEX 2011 Prove It Track," In the Proceedings of the Initiative for the Evaluation of XML Retrieval workshop (INEX'2011) Saarland, Germany, December 12-14, 2011.
  • IR-895: (2012) Bendersky, M. and Smith, D., "A Dictionary of Wisdom and Wit: Learning to Extract Quotable Phrases," Workshop on Computational Linguistics for Literature, co-located with the 2012 Conference of the North American Chapter of the Association for Computational Linguistics, Montréal, Québec, Canada, June 8, 2012.
  • IR-904: (2012) Feild, H. and Allan, J., "Task Aware Search Assistant," Proceedings of the 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1015.
  • IR-912: (2012) Cartright, M., Can, E., Dabney, W., Dalton, J., Krstovski, K., Wu, X., Yalniz, I., Allan, J., Manmatha, R. and Smith, D., "A Framework for Manipulating and Searching Multiple Retrieval Types," Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1001.
  • IR-914: (2012) Cartright, M., Dalton, J. and Allan, J., "Search and Exploration of Scanned Books," BooksOnline Workshop 2012 (co-located with CIKM 2012, Maui, Hawaii) pp. 9-10.
  • IR-916: (2012) Dalton, J. and Dietz, L., "Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track," in the Notebook Proceedings of The Twenty first Text REtrieval Conference (TREC 2012), Gaithersburg, MD, USA, November 7-9, 2012.
  • IR-917: (2012) Dietz, L. and Dalton, J., "Across-Document Neighborhood Expansion: UMass at TAC KBP 2012 Entity Linking," in the Proceedings of the Text Analysis Conference, Gaithersburg, MD, USA, November 5-6, 2012.
  • IR-926: (2013) Feild, H. and Allan, J., "Task-aware query recommendation," Proceedings of the 36th Annual ACM SIGIR Conference (SIGIR 2013), Dublin, Ireland, July 28-August 1, 2013, pp. 83-92.
  • IR-936: (2013) Krstovski, K., Smith, D., Wallach, H. and McGregor, A., "Efficient Nearest-Neighbor Search in the Probability Simplex," in the Proceedings of the 4th International Conference on the Theory of Information Retrieval, 29 September - 2 October 2013, Copenhagen, Denmark.
  • IR-945: (2013) Krstovski, K. and Smith, D., "Online Polylingual Topic Models for Fast Document Translation Detection," in the Proceedings of the 8th Workshop on Statistical Machine Translation, ACL 2013, Sofia, Bulgaria, August 4-9 2013.
  • IR-952: (2013) Kong, W., Aktolga, E. and Allan, J., "Improving Passage Ranking with User Behavior Information," in the Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, Oct. 27-Nov. 1, 2013.
  • IR-959: (2013) Feild, H., "Exploring Privacy and Personalization in Information Retrieval Applications," Ph.D. Thesis, University of Massachusetts Amherst, 2013.
  • MM-807: (2011) Yalniz, I., Can, E. and Manmatha, R., "Partial Duplicate Detection for Large Book Collections," in the Proceedings of CIKM 2011, pp. 469-474.
  • MM-818: (2011) Yalniz, I. and Manmatha, R., "A Fast Alignment Scheme for Automatic OCR Evaluation of Books," in the Proceedings of The International Conference on Document Analysis and Recognition (ICDAR '11), pp. 754-758.
  • MM-835: (2011) Jain, R., Frinken, V., Jawahar, C. and Manmatha, R., "BLSTM Neural Network based Word Retrieval for Hindi Documents," in the Proceedings of the International Conference on Document Analysis and Recognition, (ICDAR'11) pp. 83-87.
  • MM-857: (2011) Yalniz, I. and Manmatha, R., "An efficient framework for searching text in noisy document images," Proceedings of the 10th IAPR International Workshop on Document Analysis Systems (DAS'12), pp. 48-52.
  • MM-871: (2012) Yalniz, I. and Manmatha, R., "Finding Translations in Scanned Book Collections," in the Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), Portland, OR, August 12-16, 2012, pp. 465-474.
  • MM-905: (2012) Fernandez, D., Lladós, J., Fornés, A. and Manmatha, R., "On Influence of Line Segmentation in Efficient Word Segmentation in Old Manuscripts.," Proceedings of the International Conference on Frontiers of Handwriting Recognition (ICFHR 2012), pp.759-764.
  • IR-969: (2013) Dietz, L. and Dalton, J., "UMass at TREC 2013 Knowledge Base Acceleration Track: Bi-directional Entity Linking and Time-aware Evaluation," Notebook Proceedings of the Text Retrieval Conference, Gaithersburg, MD, USA, November 20-22, 2013.
  • IR-970: (2013) Dalton, J. and Dietz, L., "UMass CIIR at TAC KBP 2013 Entity Linking: Query Expansion using Urban Dictionary," Notebook Proceedings of the Text Analysis Conference, Gaithersburg, MD, USA, November 19-20, 2013.
  • IR-977: (2014) Dalton, J., Dietz, L. and Allan, J., "Entity Query Feature Expansion using Knowledge Base Links," Proceedings of the 37th Annual International ACM SIGIR conference, Gold Coast, Queensland, Australia, July 6-11, 2014, pp. 365-374.
  • IR-984: (2014) Cartright, M., "Query-Time Optimization Techniques for Structured Queries in Information Retrieval," Ph.D. Thesis, University of Massachusetts Amherst, August 2013.
  • IR-985: (2014) Jiang, J. and Allan, J., "Necessary and Frequent Terms in Queries," Proceedings of the 37th Annual International ACM SIGIR conference (SIGIR 2014), Gold Coast, Queensland, Australia, July 6-11, 2014, pp. 1167-1170.
  • IR-992: (2014) Aktolga, E., "Integrating Non-Topical Aspects into Information Retrieval," Ph.D. Thesis, University of Massachusetts Amherst.
  • IR-999: (2014) Foley, J. and Allan, J., "Retrieving Time from Scanned Books," in Proceedings of the 37th European Conference on Information Retrieval (ECIR 2015), Vienna, Austria, March 29 - April 2, 2015, pp. 221-232.
  • IR-1014: (2014) Jang, M., Allan, J. and Choi, J., "Identification of Non-textual Components in Technical Documents," submitted to the Proceedings of the Association for the Advancement of Artificial Intelligence conference (AAAI 16), Phoenix, Arizona USA, February 12–17, 2016.
  • MM-974: (2014) Murthy, V., Can, E. and Manmatha, R., "A Hybrid Model for Automatic Image Annotation," Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR), Glasgow University, Scotland, April 1-4, 2014, pp. 369.
  • MM-982: (2014) Yalniz, I., "Efficient representation and matching of texts and images in scanned book collections," Ph.D. Thesis, University of Massachusetts Amherst, February 2014.
  • IR-1026: (2015) Murthy, V., Maji, S. and Manmatha, R., "Automatic Image Annotation using Deep Learning Representations," in the Proceedings of ICMR 2015,Shanghai, China, June 23-26, 2015.
  • IR-1039: (2015) Allan, J. and Wemhoener, D., "Balancing Aspects in Retrieved Search Results," in the Proceedings of the International Conference on Theoretical Information Retrieval, Northampton, Massachusetts, September 27 - October 1, 2015.
  • IR-1053: (2016) Jiang, J. and Allan, J., "Correlation Between System and User Metrics in a Session," to appear in the Proceedings of the first ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2016) Chapel Hill, North Carolina, USA, March 13-17, 2016.

This work is supported by the National Science Foundation (Award Number 0910884). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.