OCTO OCR ERROR CORRECTION TOOL - Version 1.0

Purpose:

OCTO OCR Error Correction Tool is designed to efficiently align multiple OCR outputs to correct character recognition errors. The implementation can handle very long texts including millions of words/characters. This is achieved by aligning each pair of input text using the Recursive Text Alignment Scheme ( RETAS ) introduced by Yalniz and Manmatha in 2011 and combining the pair-wise alignments to generate the multiple alignment output. In this version, the total number of input texts is restricted to three. The output is the corrected OCR output. One other parameter is the set of characters ignored in the input texts. This parameter can be used to ignore certain characters such as annotations and/or punctuation letters. It should be noted that this is just a prototype. The code can be speeded up or improved further in several ways.

IMPORTANT NOTICE:

This software was developed at the Center for Intelligent Information Retrieval (CIIR), University of Massachusetts Amherst. Basic research to develop the software was funded by the CIIR and the National Science Foundation while its application was supported by a grant from the Mellon Foundation. Any opinions, findings and conclusions or recommendations expressed in this material are the authors' and do not necessarily reflect those of the sponsor.

This program is free software: you can redistribute it and/or modify it under the terms of the BSD 3-Clause License. You should have received a copy of the license along with this program. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Citation Information:

If you make use of the the text alignment tool, please cite the following papers:

[1] I. Zeki Yalniz, R. Manmatha: A Fast Alignment Scheme for Automatic OCR Evaluation of Books. ICDAR 2011: 754-758

[2] Wemhoener, D., Yalniz, I. Z., and Manmatha, R., "Creating an Improved Version Using Noisy OCR from Multiple Editions. ICDAR 2013.

How to obtain the source code?

Please contact

I. R. Manmatha: manmatha[at]cs[dot]umass[dot]edu OR

II. downloads[at]ciir[dot]cs[dot]umass[dot]edu

A download link will be provided.

How to compile:

Inside the source folder, type the following command to compile the code (tested for Java version 1.6):

"javac *.java"

How to use the tool?

A - COMMAND LINE INTERFACE:

USAGE: java OCRerrorCorrector <ocrOutputFilename1> <ocrOutputFilename2> <ocrOutputFilename2> <correctedOCRoutput> <ignoredChars>

<ocrOutputFilename1> is the input filename for the OCR text 1

<ocrOutputFilename2> is the input filename for the OCR text 2

<ocrOutputFilename3> is the input filename for the OCR text 3

<correctedOCRoutput> is the output filename for the error corrected text

<ignoredChars> (optional) is a Java string of characters to be ignored

Example command:

java OCRerrorCorrector ../texts/1.txt ../texts/2.txt ../texts/3.txt ../texts/correctedText.txt ".,:1234567890"

B - JAVA API

Initialization:

OCRerrorCorrector corrector = new OCRerrorCorrector(

String ocr1, // OCR text output file path 1

String ocr2, // OCR text output file path 2

String ocr3, // OCR text output file path 3

String ignoredChars ); // The set of characters to be ignored in the input and output files

USE CASE 1 - how to generate the multiple alignment itself

ArrayList<MultipleAlignedSequence> alignment = corrector.align();

USE CASE 2 - how to print the multiple alignment result into a text file

corrector.printAlignment(

String outputFilenameFullPath,

boolean colFormat); // 'colFormat' parameter indicates the format of the output file. False is the default value.

USE CASE 3 - how to generate error corrected OCR output

corrector.printCorrectedOCRoutput( String outputFilenameFullPath );

NOTE: The alignment is case sensitive. One can obtain a case-insensitive alignment by preprocessing the input documents accordingly.

Last updated: October 8, 2013