|
|
|
Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR
A Collaborative Project with Tufts University and the Internet Archive
James Allan, PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
allan@cs.umass.edu
http://www.cs.umass.edu/~allan
R. Manmatha, co-PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
manmatha@cs.umass.edu
http://www.cs.umass.edu/~manmatha
David Smith, co-PI
Department of Computer Science
140 Governors Drive
University of Massachusetts
Amherst, Massachusetts 01003-4610
dasmith@cs.umass.edu
http://www.cs.umass.edu/~dasmith
Project Award Information
National Science Foundation Award Number: IIS - 0910884
Data-intensive Computing
Duration: 09/01/09 - 08/31/13
Project Summary
The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections.
To provide effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR
output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features.
When applied across large corpora, these technologies reinforce each other: improved topic modeling enables more targeted language models for OCR; extracting structural metadata improves citation analysis; and entity extraction improves topic modeling and query expansion.
The testbed for this project is the growing corpus of over one million open-access books from the Internet Archive.
Graduate Students Involved in this Project:
Publications:
- IR-743: (2009) Balasubramanian, N., Kumaran, G. and Carvalho, V., "Predicting Query Performance on the Web," in the Proceedings of the 33rd Annual International ACM SIGIR Conference (SIGIR-2010), pp. 785-786.
- IR-745: (2009) Yi, X. and Allan, J., "A Content based Approach for Discovering Missing Anchor Text for Web Search," in the Proceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010), pp. 427-434.
- IR-747: (2009) Cartright, M., Seo, J. and Lease, M., "UMass Amherst and UT Austin @ The TREC 2009 Relevance Feedback Track ," in the Notebook Proceedings of the Text Retrieval Conference (TREC 2009) Gaithersburg, Maryland, USA, Nov 17-20, 2009.
- IR-756: (2010) Feild, H., Allan, J. and Jones, R., "Predicting Searcher Frustration," in theProceedings of the 33rd Annual ACM SIGIR Conference (SIGIR 2010), pp. 34-41, 2010.
- IR-775: (2010) Balasubramanian, N. and Allan, J., "Learning to Select Rankers," in the Proceedings of the 33rd Annual ACM SIGIR Conference, pp. 855-856, 2010.
- IR-776: (2010) Wu, X. and Smith, D., "Right-branching tree transformation for eager dependency parsing," CIIR Technical Report.
- IR-781: (2010) Cartright, M., Allan, J., Lavrenko, V. and McGregor, A., "Fast Query Expansion Using Approximations of Relevance Models," Proceedings of the Conference on Information and Knowledge Management (CIKM 2010), pp. 1573-1576.
- MM-791: (2010) Sankar, P., Jawahar, C. and Manmatha, R., "Nearest Neighbor based Collection OCR," in the Proceedings of the International Workshop on Document Analysis Systems (DAS 2010), pp. 207-214.
- IR-802: (2011) Yi, X. and Allan, J., "Discovering Missing Click-through Query Language Information for Web Search," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 153-162.
- IR-803: (2011) Aktolga, E. and Allan, J., "Reranking Search Results for Sparse Queries," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 173-183.
- IR-809: (2011) Feild, H., Allan, J. and Glatt, J., "CrowdLogging: Distributed, private, and anonymous search logging," Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR'11), pp. 375-384.
- IR-812: (2011) Cartright, M. and Allan, J., "Efficiency Optimizations for Interpolating Subqueries," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 297-306.
- IR-823: (2011) Dalton, J., Allan, J. and Smith, D., "Passage Retrieval for Incorporating Global Evidence in Sequence Labeling," in the Proceedings of the 20th ACM Conference on Information and Knowledge Management (CIKM '11), pp. 355-364.
- IR-832: (2011) Krstovski, K. and Smith, D., "A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs," in the Proceedings of WMT11 — Sixth Workshop on Statistical Machine Translation, pp. 207-216.
- IR-834: (2011) Yi, X., "Discovering and Using Implicit Data for Information Retrieval," Ph.D. thesis, May, 2011.
- IR-843: (2011) Cartright, M., Feild, H. and Allan, J., "Evidence Finding using a Collection of Books," in the Proceedings of The International Conference on Information and Knowledge Management BooksOnline workshop (BooksOnline'11), pp. 11-18.
- IR-853: (2011) Smith, D., Manmatha, R. and Allan, J., "Mining Relational Structure from Millions of Books: Position Paper," CIKM Books Online Workshop, Glasgow, Scotland, U.K., 2011.
- IR-866: (2011) Balasubramanian, N., "Query-Dependent Selection of Retrieval Alternatives," Ph.D. Dissertation, University of Massachusetts Amherst.
- IR-882: (2012) Feild, H., Cartright, M. and Allan, J., "The University of Massachusetts Amherst’s participation in the INEX 2011 Prove It Track," In the Proceedings of the Initiative for the Evaluation of XML Retrieval workshop (INEX'2011) Saarland, Germany, December 12-14, 2011.
- IR-895: (2012) Bendersky, M. and Smith, D., "A Dictionary of Wisdom and Wit: Learning to Extract Quotable Phrases," Workshop on Computational Linguistics for Literature, co-located with the 2012 Conference of the North American Chapter of the Association for Computational Linguistics, Montréal, Québec, Canada, June 8, 2012.
- MM-807: (2011) Yalniz, I., Can, E. and Manmatha, R., "Partial Duplicate Detection for Large Book Collections," in the Proceedings of CIKM 2011, pp. 469-474.
- MM-818: (2011) Yalniz, I. and Manmatha, R., "A Fast Alignment Scheme for Automatic OCR Evaluation of Books," in the Proceedings of The International Conference on Document Analysis and Recognition (ICDAR '11), pp. 754-758.
- MM-835: (2011) Jain, R., Frinken, V., Jawahar, C. and Manmatha, R., "BLSTM Neural Network based Word Retrieval for Hindi Documents," in the Proceedings of the International Conference on Document Analysis and Recognition, (ICDAR'11) pp. 83-87.
- MM-857: (2011) Yalniz, I. and Manmatha, R., "An efficient framework for searching text in noisy document images," Proceedings of the 10th IAPR International Workshop on Document Analysis Systems (DAS'12), pp. 48-52.
- MM-871: (2012) Yalniz, I. and Manmatha, R., "Finding Translations in Scanned Book Collections," to appear in the Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), Portland, OR, August 12-16, 2012.
This work is supported by the National Science Foundation (Award Number 0910884).
|