The following are some pointers on how to use the handwritten manuscript images
and related files in this archive:

The legal stuff first:
- this data is for NON COMMERCIAL USE ONLY! It is solely intended for research
  purposes.
- you may not share this data set without our permission. Instead, please point
  people to the CIIR's download site http://ciir.cs.umass.edu/downloads/ (then
  click on the button next to "Word Image Data Sets").
- by using this data set you agree to the copyright notice contained in the
  file copyrightnotice.txt in this archive.
("Our"/"we" in this document refers to the Center for Intelligent Information
Retrieval at the University of Massachusetts Amherst).

This archive contains a set of 20 page images from the George Washington
collection at the Library of Congress. All images are in tiff format, 300dpi
and were scanned from microfilm. We have created manual segmentation
coordinates that allow you to extract images of all words on these pages. There
is also ground-truth data, which provides annotations/labels for the word
images.

If you are using the data set in a publication, please cite the following
article:
V. Lavrenko, T. M. Rath and R. Manmatha: Holistic Word Recognition for Handwritten Historical Documents. In: Proc. of the Int'l Workshop on Document Image Analysis for Libraries (DIAL), Palo Alto, CA, January 23-24, 2004, pp. 278-287.

The archive contains the following files:
- *.tif: images of handwritten pages, provided to us by the Library of Congress
- *_boxes.txt: segmentation coordinates for extracting images of words.
- annotations.txt: contains manually assigned annotations for each word image.
- file_order.txt: order in which you should process the files when using
  annotations.txt and the segmentation files *_boxes.txt.
- copyrightnotice.txt

Here is some information about file formats:
- *_boxes.txt: each line of this file contains coordinates of a single word
  image:
  x1 x2 y1 y2 liney1 liney2
  x1 x2 y1 y2 liney1 liney2
  ...

  x1/y1/x2/y2 are the coordinates for the word image, and liney1/liney2 are
  coordinates for the line the word occurs in. Only the word image coordinates
  were manually corrected on downscaled versions of the images in this archive.
  All coordinates are stored in relative notation, that is the range is [0..1].
  0 corresponds to the top or left, and 1 corresponds to the bottom or right.
  The line coordinates were generated automatically.

  If you are using Matlab, you can use the 'seg_boxes' function contained in
  this archive. It was tested on UNIX, compatibility with other OSes is unknown.
- annotations.txt: each line in the file contains an annotation for a single
  word image, provided by a human annotator. This file can be mapped to the
  *_boxes.txt files in a line-by-line fashion (ignoring the first header line),
  using the file file_order.txt. 

The annotations and segmentation files have not been double-checked. It is
quite likely that you will find segmentation or annotation mistakes. In that
case, please send email to trath@cs.umass.edu, so we can update the data set.

Scanned images of the original documents were provided to us by the Library of
Congress and this project was funded in part by the Center for Intelligent
Information Retrieval and in part by the National Science Foundation under
grant number IIS-9909073.

----------
Toni M. Rath (trath@cs.umass.edu) 08/2004
