Handwriting Retrieval Demonstrations

Introduction and Instructions

Center for Intelligent Information Retrieval
University of Massachusetts Amherst

Introduction

This page contain a brief introduction to the handwritten historical document retrieval systems that have been developed at the Center for Intelligent Information Retrieval. These systems retrieve actual handwritten pages (not transcriptions) given text queries. To the best of our knowledge, these are the first off-line handwriting retrieval systems. All of the demonstration systems described here have been built on a subset of the George Washington collection at the Library of Congress. We are providing a small portion of this collection with ground truth on the CIIR download page (look for Word Image Data Sets).


Example page from the George Washington collection.	Screenshot of the line retrieval demo system.

Instructions

Querying

All of the demonstration systems below allow you to enter one or more query terms in English and in ASCII notation (most image-based retrieval systems require a query in terms of an image). The results are returned as a ranked list of retrieval units, which are either lines or pages.

Good Query Terms

When querying you should keep in mind that the particular document subset we used for our demonstration systems is mostly about orders that Washington sent to various places. In fact, most of these letters contain the header Letters, Orders and Instructions as in the page above. Query terms that are likely to yield results are of military nature (e.g. regiment), place names (e.g. Alexandria), organizational terms (e.g. provision) and similar ones. A good place to pick query terms are the documents themselves: try retrieving a page with one of the above terms and pick other query terms from there.

Query Confidence Scores

When you query, you may see a query confidence indicator bar. It is meant to give you an idea of how well the system expects to do for the given query term. If the bar is less than half, this means that the amount of training data is really small. The more training data the system has for the given word, the better the retrieval performance will be when using that query.

Stemming

When you enter a query term, it is stemmed. That is, when you search for recruit, recruits, recruiting, etc. you will get the exact same results. This is useful for finding words that are morphologically similar to the query.

Retrieval Results

In the case of line retrieval, at each rank an entire line is returned. Clicking on any of the word images in a line will take you to the original page image.

When doing page retrieval, you will see a page thumbnail on the left-hand side, and snippets on the right-hand side. For each query term that you entered, there will be one snippet. A snippet consists of the word image on the returned page that has the highest probability of matching the corresponding query term. To the left and right of the matching term, several words are returned to provide some context. The snippets are intended to help the user decide whether an entire page is relevant to the query.
Clicking on the page thumbnail will show an image of the entire page, clicking on a snippet word will show the page image, with a box marking the word that was selected.

Result browsing

The ranked list is organized by displayed several ranks at a time on one page. At the bottom of each page, you can navigate the list by going forward and backward one page.

Demonstration Systems

We have built the following three prototype retrieval systems:

Line retrieval: small collection of 20 pages total. This collection was automatically segmented into words. A manually corrected version of this dataset with high-quality segmentation information can be found on the CIIR download page (click on Word Image Data Sets).

Try me first! Page retrieval using probabilistic annotation: large collection of 1000 pages, fast query response.

Page retrieval using Kullback-Leibler scoring (will be available later): 1000 pages; the collection is searched in realtime, resulting in a query response time of about 50 seconds.

All of the above systems use the Cross-Media Relevance Modeling approach. Below is a list of all the retrieval techniques that we are currently investigating.

Retrieval Approaches

We are currently investigating three different approaches to handwritten historical document retrieval:

Word Spotting: An approach based on word image matching.
Cross-Media Relevance Modeling: The approach that the demonstration systems are based on.
Recognition and Retrieval: Documents are recognized in order to apply standard text retrieval techniques on the recognition result.

Acknowledgments

Funding

This work was supported in part by the Center for Intelligent Information Retrieval at the University of Massachusetts Amherst and in part by the National Science Foundation under grant number IIS-9909073.

Data

We would like to thank the Library of Congress for providing digitized images of George Washington's original manuscripts.

People

Main contributions: Toni M. Rath, R. Manmatha and Victor Lavrenko.
N. Srimal and Jamie Rothfeder worked on the automatic segmentation code that was used in our experiments.