This page contain a brief introduction to the handwritten historical
document retrieval systems that have been developed at the Center for
Intelligent Information Retrieval. These systems retrieve actual
handwritten pages (not transcriptions) given text queries. To the best
of our knowledge, these are the first off-line handwriting retrieval
systems. All of the demonstration systems described here have been
built on a subset of the George Washington collection at the Library
of Congress. We are providing a small portion of this collection with
ground truth on the CIIR
download page (look for Word Image Data Sets).
Example page from the George Washington collection.
Screenshot of the line retrieval demo system.
Instructions
Querying
All of the demonstration systems below allow you to
enter one or more query terms in English and in ASCII notation (most
image-based retrieval systems require a query in terms of an
image). The results are returned as a ranked list of retrieval units,
which are either lines or pages.
Good Query Terms
When querying you should keep in mind that the particular document
subset we used for our demonstration systems is mostly about orders
that Washington sent to various places. In fact, most of these letters
contain the header Letters, Orders and Instructions as in the
page above. Query terms that are likely to yield results are of
military nature (e.g. regiment), place names
(e.g. Alexandria), organizational terms
(e.g. provision) and similar ones. A good place to pick query
terms are the documents themselves: try retrieving a page with one of
the above terms and pick other query terms from there.
Query Confidence Scores
When you query, you may see a query confidence indicator
bar. It is meant to give you an idea of how well the system expects to
do for the given query term. If the bar is less than half, this means
that the amount of training data is really small. The more training
data the system has for the given word, the better the retrieval
performance will be when using that query.
Stemming
When you enter a query term, it is stemmed. That is, when you search for recruit, recruits, recruiting, etc. you will get the exact same results. This is useful for finding words that are morphologically similar to the query.
Retrieval Results
In the case of line retrieval, at each rank an entire line is
returned. Clicking on any of the word images in a line will take you
to the original page image.
When doing page retrieval, you will see a page thumbnail on the
left-hand side, and snippets on the right-hand side. For each
query term that you entered, there will be one snippet. A snippet
consists of the word image on the returned page that has the highest
probability of matching the corresponding query term. To the left and
right of the matching term, several words are returned to provide some
context. The snippets are intended to help the user decide whether an
entire page is relevant to the query. Clicking on the page
thumbnail will show an image of the entire page, clicking on a snippet
word will show the page image, with a box marking the word that was
selected.
Result browsing
The ranked list is organized by displayed several ranks at
a time on one page. At the bottom of each page, you can navigate the
list by going forward and backward one page.
Demonstration Systems
We have built the following three prototype retrieval systems:
Line retrieval: small collection of 20
pages total. This collection was automatically segmented into
words. A manually corrected version of this dataset with
high-quality segmentation information can be found on the CIIR
download page
(click on Word Image Data Sets).
Page retrieval using Kullback-Leibler
scoring (will be available later): 1000 pages; the collection is
searched in realtime, resulting in a query response time of about 50
seconds.
All of the above systems use the Cross-Media Relevance Modeling
approach. Below is a list of all the retrieval techniques that we are
currently investigating.
Retrieval Approaches
We are currently investigating three different approaches to handwritten historical document retrieval:
Word Spotting: An approach
based on word image matching.
We would like to thank the Library of Congress for providing digitized images of George Washington's original manuscripts.
People
Main contributions: Toni M. Rath, R. Manmatha and Victor Lavrenko.
N. Srimal and Jamie Rothfeder worked on the automatic segmentation code
that was used in our experiments.