Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR

A Collaborative Project with Tufts University and the Internet Archive

James Allan, PI
College of Information and Computer Sciences
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-9264
allan@cs.umass.edu
http://www.cs.umass.edu/~allan

R. Manmatha, co-PI
College of Information and Computer Sciences
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-9264
manmatha@cs.umass.edu
http://www.cs.umass.edu/~manmatha

David Smith, co-PI
College of Information and Computer Sciences
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-9264
dasmith@cs.umass.edu
http://www.cs.umass.edu/~dasmith

Project Award Information

National Science Foundation Award Number: IIS - 0910884
Duration: 09/01/09 - 08/31/16

Project Summary

The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections.

To provide effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features.

When applied across large corpora, these technologies reinforce each other: improved topic modeling enables more targeted language models for OCR; extracting structural metadata improves citation analysis; and entity extraction improves topic modeling and query expansion.

The testbed for this project is the growing corpus of over one million open-access books from the Internet Archive.

Outcomes
This project was a collaborative effort between the University of Massachusetts Amherst, Tufts University, and the Internet Archive. It developed and evaluated techniques for processing the large amounts of data contained in massive collections of scanned books. One of the challenges with collections of this type is cleaning up the data so that it can be used by people and computer systems. To address that challenge, we developed an efficient approach for recognizing the language of a book in the presence of substantial text recognition (OCR) errors. When we applied the technique to 3.6 million scanned books from the Internet Archive, we found that more than 30,000 were assigned the wrong language, often resulting in lower quality text recognition. We also developed a number of approaches for finding duplicated material in the collection—not just complete copies of books, but smaller works that are reproduced in other collections. These techniques were largely based on using the sequence of words that occur only once in a given book. Books that overlap heavily in their sequence of unique words are (partial) copies of each other. Using unique words also makes for efficient processing. For example, we analyzed 2.2 million books to find 230 copies of Hamlet included in collected works of Shakespeare, in collections of Elizabethan drama, in anthologies of British literature, and so on. An extended version of this approach allowed us to recognize translations with reasonable accuracy. A different version of this text-reuse analysis allowed us to identify common quotations from canonical works prepared by project partners at Tufts’ Perseus Digital Library.

Another line of work supported by this project investigated approaches to understand and improve state-of-the-art retrieval capabilities on large-scale collections. We showed that recasting a widely-used query expansion technique (adding synonyms and related words to a query) can be done efficiently with pre-processing of the data. We developed methods to extend facet-based query refinement (found in, e.g., shopping sites to restrict results by manufacturer, price range, or color) to the unstructured web setting. We demonstrated that sequences of long queries, often created as part of automated search processes, can be handled efficiently by remembering and incorporating components from past queries, assembling the already-run pieces carefully to handle new queries. We developed new methods for rapidly searching large collections for highly similar sets of documents represented as points in continuous space. We showed that interaction features such as clicks or anchor text used to rank documents (e.g., web pages, medical informatics material) can be transferred from those similar documents with success, allowing less popular material (fewer clicks and fewer links) to rise in the ranking rather than being buried by more frequently accessed but sometimes less relevant items.

A final area of work that this project enabled addressed the sorts of pre-processing of data that is necessary to enable efficient search or rich presentation of documents. We developed a number of approaches for recognizing mentions of people, places, and things in text and linking them to an external explanatory database such as Wikipedia – for example, a news story mentioning “President Bush” might be linked to the Wikipedia article for the correct Bush. We designed approaches that use expensive linguistic processing and machine learning as well as approaches that are almost as accurate but that use substantially faster processing. We also explored new techniques for learning to automatically annotate images with keywords and phrases that describe what is in them.

All of the research carried out within this project investigated ways to process massive collections of text, often in the presence of scanning and text recognition errors, in order to make the collection available for searching, browsing, and data mining. When possible, evaluation data has been made freely available on the web for use by other researchers.

This project resulted in 51 refereed conference publications, supported the training of 21 graduate students, resulted in six PhD dissertations, and laid the foundation for technology transition work supported by non-government funding.

Graduate Students Involved in this Project:

Elif Aktolga
Niranjan Balasubramanian
Ethem Can
Marc Cartright
William Dabney
Jeff Dalton
Shiri Dori-Hacohen
Henry Feild
John Foley
Myung-ha Jang
Jiepu Jiang
Weize Kong
Kriste Krstovski
Pranav Mirajkar
Venkatesh Murthy
Patrick Verga
David Wemhoener
Xiaoye Wu
Ismet Zeki Yalniz
Xing Yi
Mao Zhao

Publications:

  • IR-743: (2009) Balasubramanian, N., Kumaran, G. and Carvalho, V., "Predicting Query Performance on the Web," in the Proceedings of the 33rd Annual International ACM SIGIR Conference (SIGIR-2010), pp. 785-786.
  • IR-745: (2009) Yi, X. and Allan, J., "A Content based Approach for Discovering Missing Anchor Text for Web Search," in the Proceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010), pp. 427-434.
  • IR-747: (2009) Cartright, M., Seo, J. and Lease, M., "UMass Amherst and UT Austin @ The TREC 2009 Relevance Feedback Track ," in the Notebook Proceedings of the Text Retrieval Conference (TREC 2009) Gaithersburg, Maryland, USA, Nov 17-20, 2009.
  • IR-756: (2010) Feild, H., Allan, J. and Jones, R., "Predicting Searcher Frustration," in theProceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010), pp. 34-41, 2010.
  • IR-775: (2010) Balasubramanian, N. and Allan, J., "Learning to Select Rankers," in the Proceedings of the 33rd Annual ACM SIGIR Conference, pp. 855-856, 2010.
  • IR-776: (2010) Wu, X. and Smith, D., "Right-branching tree transformation for eager dependency parsing," CIIR Technical Report.
  • IR-781: (2010) Cartright, M., Allan, J., Lavrenko, V. and McGregor, A., "Fast Query Expansion Using Approximations of Relevance Models," Proceedings of the Conference on Information and Knowledge Management (CIKM 2010), pp. 1573-1576.
  • MM-791: (2010) Sankar, P., Jawahar, C. and Manmatha, R., "Nearest Neighbor based Collection OCR," in the Proceedings of the International Workshop on Document Analysis Systems (DAS 2010), pp. 207-214.
  • IR-802: (2011) Yi, X. and Allan, J., "Discovering Missing Click-through Query Language Information for Web Search," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 153-162.
  • IR-803: (2011) Aktolga, E. and Allan, J., "Reranking Search Results for Sparse Queries," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 173-183.
  • IR-809: (2011) Feild, H., Allan, J. and Glatt, J., "CrowdLogging: Distributed, private, and anonymous search logging," Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR'11), pp. 375-384.
  • IR-812: (2011) Cartright, M. and Allan, J., "Efficiency Optimizations for Interpolating Subqueries," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 297-306.
  • IR-823: (2011) Dalton, J., Allan, J. and Smith, D., "Passage Retrieval for Incorporating Global Evidence in Sequence Labeling," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 355-364.
  • IR-832: (2011) Krstovski, K. and Smith, D., "A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs," in the Proceedings of WMT11 — Sixth Workshop on Statistical Machine Translation, pp. 207-216.
  • IR-834: (2011) Yi, X., "Discovering and Using Implicit Data for Information Retrieval," Ph.D. thesis, May, 2011.
  • IR-843: (2011) Cartright, M., Feild, H. and Allan, J., "Evidence Finding using a Collection of Books," in the Proceedings of The International Conference on Information and Knowledge Management BooksOnline workshop (BooksOnline'11), pp. 11-18.
  • IR-853: (2011) Smith, D., Manmatha, R. and Allan, J., "Mining Relational Structure from Millions of Books: Position Paper," CIKM Books Online Workshop, Glasgow, Scotland, U.K., 2011.
  • IR-866: (2011) Balasubramanian, N., "Query-Dependent Selection of Retrieval Alternatives," Ph.D. Dissertation, University of Massachusetts Amherst.
  • IR-872: (2012) Kim, J., Feild, H. and Cartright, M., "Understanding Book Search Behavior on the Web," Proceedings of the 21st ACM international conference on Information and knowledge management, (CIKM 2012), pp. 744-753.
  • IR-882: (2012) Feild, H., Cartright, M. and Allan, J., "The University of Massachusetts Amherst’s participation in the INEX 2011 Prove It Track," In the Proceedings of the Initiative for the Evaluation of XML Retrieval workshop (INEX'2011) Saarland, Germany, December 12-14, 2011.
  • IR-895: (2012) Bendersky, M. and Smith, D., "A Dictionary of Wisdom and Wit: Learning to Extract Quotable Phrases," Workshop on Computational Linguistics for Literature, co-located with the 2012 Conference of the North American Chapter of the Association for Computational Linguistics, Montréal, Québec, Canada, June 8, 2012.
  • IR-904: (2012) Feild, H. and Allan, J., "Task Aware Search Assistant," Proceedings of the 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1015.
  • IR-912: (2012) Cartright, M., Can, E., Dabney, W., Dalton, J., Krstovski, K., Wu, X., Yalniz, I., Allan, J., Manmatha, R. and Smith, D., "A Framework for Manipulating and Searching Multiple Retrieval Types," Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1001.
  • IR-914: (2012) Cartright, M., Dalton, J. and Allan, J., "Search and Exploration of Scanned Books," BooksOnline Workshop 2012 (co-located with CIKM 2012, Maui, Hawaii) pp. 9-10.
  • IR-916: (2012) Dalton, J. and Dietz, L., "Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track," in the Notebook Proceedings of The Twenty first Text REtrieval Conference (TREC 2012), Gaithersburg, MD, USA, November 7-9, 2012.
  • IR-917: (2012) Dietz, L. and Dalton, J., "Across-Document Neighborhood Expansion: UMass at TAC KBP 2012 Entity Linking," in the Proceedings of the Text Analysis Conference, Gaithersburg, MD, USA, November 5-6, 2012.
  • IR-926: (2013) Feild, H. and Allan, J., "Task-aware query recommendation," Proceedings of the 36th Annual ACM SIGIR Conference (SIGIR 2013), Dublin, Ireland, July 28-August 1, 2013, pp. 83-92.
  • IR-936: (2013) Krstovski, K., Smith, D., Wallach, H. and McGregor, A., "Efficient Nearest-Neighbor Search in the Probability Simplex," in the Proceedings of the 4th International Conference on the Theory of Information Retrieval, 29 September - 2 October 2013, Copenhagen, Denmark.
  • IR-945: (2013) Krstovski, K. and Smith, D., "Online Polylingual Topic Models for Fast Document Translation Detection," in the Proceedings of the 8th Workshop on Statistical Machine Translation, ACL 2013, Sofia, Bulgaria, August 4-9 2013.
  • IR-952: (2013) Kong, W., Aktolga, E. and Allan, J., "Improving Passage Ranking with User Behavior Information," in the Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, Oct. 27-Nov. 1, 2013.
  • IR-1051: (2016) Kong, W. and Allan, J., "Precision-Oriented Query Facet Extraction," in the Proceedings of the 25rd ACM International Conference on Conference on Information and Knowledge Management, October 24-28, 2016, Indianapolis, IN, pp. 1433-1442.
  • IR-1074: (2014) Dalton, J., "Entity-Based Enrichment For Information Extraction And Retrieval ," Ph.D. Thesis, University of Massachusetts Amherst.
  • IR-959: (2013) Feild, H., "Exploring Privacy and Personalization in Information Retrieval Applications," Ph.D. Thesis, University of Massachusetts Amherst, 2013.
  • MM-807: (2011) Yalniz, I., Can, E. and Manmatha, R., "Partial Duplicate Detection for Large Book Collections," in the Proceedings of CIKM 2011, pp. 469-474.
  • MM-818: (2011) Yalniz, I. and Manmatha, R., "A Fast Alignment Scheme for Automatic OCR Evaluation of Books," in the Proceedings of The International Conference on Document Analysis and Recognition (ICDAR '11), pp. 754-758.
  • MM-835: (2011) Jain, R., Frinken, V., Jawahar, C. and Manmatha, R., "BLSTM Neural Network based Word Retrieval for Hindi Documents," in the Proceedings of the International Conference on Document Analysis and Recognition, (ICDAR'11) pp. 83-87.
  • MM-857: (2011) Yalniz, I. and Manmatha, R., "An efficient framework for searching text in noisy document images," Proceedings of the 10th IAPR International Workshop on Document Analysis Systems (DAS'12), pp. 48-52.
  • MM-871: (2012) Yalniz, I. and Manmatha, R., "Finding Translations in Scanned Book Collections," in the Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), Portland, OR, August 12-16, 2012, pp. 465-474.
  • MM-905: (2012) Fernandez, D., Lladós, J., Fornés, A. and Manmatha, R., "On Influence of Line Segmentation in Efficient Word Segmentation in Old Manuscripts.," Proceedings of the International Conference on Frontiers of Handwriting Recognition (ICFHR 2012), pp.759-764.
  • IR-969: (2013) Dietz, L. and Dalton, J., "UMass at TREC 2013 Knowledge Base Acceleration Track: Bi-directional Entity Linking and Time-aware Evaluation," Notebook Proceedings of the Text Retrieval Conference, Gaithersburg, MD, USA, November 20-22, 2013.
  • IR-970: (2013) Dalton, J. and Dietz, L., "UMass CIIR at TAC KBP 2013 Entity Linking: Query Expansion using Urban Dictionary," Notebook Proceedings of the Text Analysis Conference, Gaithersburg, MD, USA, November 19-20, 2013.
  • MM-974: (2014) Murthy, V., Can, E. and Manmatha, R., "A Hybrid Model for Automatic Image Annotation," Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR), Glasgow University, Scotland, April 1-4, 2014, pp. 369.
  • MM-982: (2014) Yalniz, I., "Efficient representation and matching of texts and images in scanned book collections," Ph.D. Thesis, University of Massachusetts Amherst, February 2014.
  • IR-977: (2014) Dalton, J., Dietz, L. and Allan, J., "Entity Query Feature Expansion using Knowledge Base Links," Proceedings of the 37th Annual International ACM SIGIR conference, Gold Coast, Queensland, Australia, July 6-11, 2014, pp. 365-374.
  • IR-984: (2014) Cartright, M., "Query-Time Optimization Techniques for Structured Queries in Information Retrieval," Ph.D. Thesis, University of Massachusetts Amherst, August 2013.
  • IR-985: (2014) Jiang, J. and Allan, J., "Necessary and Frequent Terms in Queries," Proceedings of the 37th Annual International ACM SIGIR conference (SIGIR 2014), Gold Coast, Queensland, Australia, July 6-11, 2014, pp. 1167-1170.
  • IR-992: (2014) Aktolga, E., "Integrating Non-Topical Aspects into Information Retrieval," Ph.D. Thesis, University of Massachusetts Amherst.
  • IR-999: (2014) Foley, J. and Allan, J., "Retrieving Time from Scanned Books," in Proceedings of the 37th European Conference on Information Retrieval (ECIR 2015), Vienna, Austria, March 29 - April 2, 2015, pp. 221-232.
  • IR-1000: (2016) Krstovski, K. and Smith, D., "Bootstrapping Translation Detection and Sentence Extraction from Comparable Corpora," Proceedings of NAACL 2016, San Diego, California, June 12 to June 17, 2016, pp. 1127–1132.
  • IR-1023: (2015) Foley, J. and Allan, J., "Classifying exam questions into a subject-specific concept hierarchy ," Proceedings of 38th European Conference on Information Retrieval (ECIR 2016), Padova, Italy, March 20-23, 2016, pp. 575-586.
  • IR-1024: (2015) Jiang, J. and Allan, J., "Reducing click and skip errors in search result ranking," in Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM '16), San Francisco, California, USA. February 22-25, 2016, pp. 183-192.
  • IR-1026: (2015) Murthy, V., Maji, S. and Manmatha, R., "Automatic Image Annotation using Deep Learning Representations," in the Proceedings of ICMR 2015,Shanghai, China, June 23-26, 2015, pp. 603-606.
  • IR-1033: (2016) Jiang, J. and Allan, J., "Adaptive Effort for Search Evaluation Metrics," to appear in Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016), Padova, Italy, March 20-23, 2016, pp. 187-199.
  • IR-1039: (2015) Allan, J. and Wemhoener, D., "Balancing Aspects in Retrieved Search Results," in the Proceedings of the International Conference on Theoretical Information Retrieval, Northampton, Massachusetts, September 27 - October 1, 2015, pp. 305-308.
  • IR-1053: (2016) Jiang, J. and Allan, J., "Correlation Between System and User Metrics in a Session," in the Proceedings of the first ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2016) Chapel Hill, North Carolina, USA, March 13-17, 2016, pp. 285-288.
  • IR-1054: (2016) Foley, J., OConnor, B. and Allan, J., "Improving Entity Ranking for Keyword Queries," in the Proceedings of The 25th ACM International Conference on Information and Knowledge Management (CIKM 2016), Indianapolis, IN, Oct. 24-28. 2016, pp. 2061-2064.
  • IR-1061: (2016) Jang, M. and Allan, J., "Improving Automated Controversy Detection on the Web," Proceedings of The International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), Pisa, Italy, July 17-21, 2016, pp. 865-868.
  • IR-1068: (2016) Jang, M., Foley, J., Dori-Hacohen, S. and Allan, J., "Probablistic Approaches to Controversy Detection," in the Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM 2016), Indianapolis, United States, October 24-28, 2016, pp. 2069-2072.
  • IR-1077: (2016) Kong, W., "Extending Faceted Search to the Open-Domain Web," Ph.D. Thesis, University of Massachusetts Amherst, May 2016.

This work is supported by the National Science Foundation (Award Number 0910884). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.