This web page contains reference for:

1 - THE RECURSIVE TEXT ALIGNMENT TOOL Version 1.1 (new)

2 – RETAS OCR evaluation dataset

1 - THE RECURSIVE TEXT ALIGNMENT TOOL Version 1.1 (new)

Purpose:

The Recursive Text Alignment Scheme (proposed by Yalniz and Manmatha, ICDAR'11) is designed to efficiently align long noisy texts despite additional and/or missing text. It has been used for estimating optical character recognition (OCR) accuracy of scanned books. It is provided here for research purposes with a GNU General Public Licence v3.0. Please read the rest of this page before proceeding.

IMPORTANT NOTICE:

This software was developed at the Center for Intelligent Information Retrieval (CIIR), University of Massachusetts Amherst. Basic research to develop the software was funded by the CIIR and the National Science Foundation while its application was supported by a grant from the Mellon Foundation. Any opinions, findings and conclusions or recommendations expressed in this material are the authors' and do not necessarily reflect those of the sponsor.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Citation Information:

If you make use of the the text alignment tool, please cite the following paper:

Ismet Zeki Yalniz, R. Manmatha: A Fast Alignment Scheme for Automatic OCR Evaluation of Books. ICDAR 2011: 754-758

How to obtain the source code?

Please contact downloads[at]ciir[dot]cs[dot]umass[dot]edu

A download link will be provided.

How to compile:

Inside the source folder, type the following command to compile the code (tested for Java version 1.6):

"javac *.java"

How to use the tool?

A - COMMAND LINE INTERFACE:

USAGE: java RecursiveAlignmentTool <refFilename> <candFilename> <outputFilename> -opt <configFile>

<refFilename> is the reference (ground truth) text filename

<candFilename> is the candidate (OCR output) text filename

<outputFilename> is the filename for the alignment output (optional)

<configFile> file must contain the following arguments on each line:

ignoredChars=<listOfChars>

alignmentFormat=<COLUMN|LINES> (default is lines)

level=<W|C> (level of alignment can be either character or word level. Default is W.)

The screen output format is:

Example command: java RecursiveAlignmentTool texts/adventuresofhuck_ground_truth.txt texts/adventuresofhuck00clemrich_OCR_output.txt texts/alignmentOutput.txt -opt config.txt

An example configuration file includes the three lines below:

------------------------------

level=CHAR

alignmentFormat=LINES

ignoredChars=,.'";:!?()[]{}<>`-+=/\$@%#|&^*_~

------------------------------

B - RETAS JAVA API

B-1) This method returns the alignment output in an ArrayList. It does not produce any text output

ArrayList<AlignedSequence> seq = RecursiveAlignmentTool.processSingleJob_getAlignedSequence(

String gtFile, // input text 1: ground truth text

String candFile, // input text 2: OCR output text (or the candidate text)

String ignoredChars, // The list of characters to be ignored

String level ); // alignment level: "c" or "w" (for character and word level alignment respectively)

B-2) This function produces the alignment at the word or character level and produces a text output file. The output file has two formats. One can also choose the characters to be ignored for the alignment.

Stats st = RecursiveAlignmentTool.processSingleJob(

gtFile, // (String) input text 1: ground truth text

candFile, // (String) input text 2: OCR output text

alignmentLevel, // (String) The level of alignment: 'c' for the character and 'w' for the the word level alignment.

outputFormat, // (String) The format of the alignment output: 'column' or 'line'

ignoredChars, // (String) The list of characters to be ignored

alignOutputFile // (String) The filename for the alignment output

);

"Stats" object contains the total number of matching characters/words and the total number of chars/words in the input texts. OCR accuracy is defined to be the total number of matching chars/words divided by the total number of chars/words in the ground truth file. One can calculate OCR accuracy by calling the getOCRaccuracy() method as: double ocrAccuracy = st.getOCRaccuracy();

B-3) If the number of matching chars/words is the only concern, then this method is faster.

Stats sts[] = RecursiveAlignmentTool.processSingleJob_getAlignmentStatsOnly(

gtFile, // (String) input text 1: ground truth text

candFile, // (String) input text 2: OCR output text

ignoredChars); // (String) The list of characters to be ignored

sts[0] contains the word level alignment statistics

sts[1] contains the character level alignment statistics

2 - RETAS OCR EVALUATION DATASET

Purpose:

RETAS dataset (used in the paper by Yalniz and Manmatha, ICDAR'11) was created to evaluate the optical character recognition (OCR) accuracy of scanned books. It is provided here for research purposes. The dataset is extracted from books in Project Gutenberg and the Internet Archive. Please read the rest of this page before proceeding.

IMPORTANT NOTICE:

According to the Project Gutenberg and Internet Archive websites, the books are out of copyright in the United States. This may not be the situation in a particular country so you are advised to check this and follow the law of your country. If you just want to read the book you are better off looking at their websites where they have much nicer interfaces for doing this. We do not know the specifics of the OCR and preprocessing they use.

THIS DATA IS PROVIDED BY THE UNIVERSITY OF MASSACHUSETTS AND OTHER CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DATA, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Characteristics:

The dataset consists of:

- The OCR output from 160 scanned books (100 English, 20 French, 20 German, 20 Spanish) downloaded from the Internet Archive website (www.archive.org)

- The corresponding ground truth text for each book. They are obtained from the Project Gutenberg website (www.gutenberg.org)

- Word and char level alignment outputs. Each OCR output is aligned with the ground truth text using REcursive Text Alignment Scheme (RETAS).

- Estimated OCR accuracies.

For the purposes of OCR evaluation, any extra material (e.g. Project Gutenberg and Internet Archives disclaimers) in the front or rear is removed so that the books are as close to the ground truth as possible.

Citation Information:

If you make use of the RETAS dataset and/or the text alignment tool, please cite the following paper:

Ismet Zeki Yalniz, R. Manmatha: A Fast Alignment Scheme for Automatic OCR Evaluation of Books. ICDAR 2011: 754-758

How to obtain the dataset?

Please contact downloads[at]ciir[dot]cs[dot]umass[dot]edu

A download link will be provided.

Dataset Information:

The data is contained in three folders:

· “\IA_texts” includes the OCR outputs of all scanned books, categorized by language (UTF-8).

· “\GUT_texts” includes the ground truth texts of scanned books, categorized by language (UTF-8).

· “\alignment_outputs” includes the alignment outputs, categorized by language (UTF-8).

Lists of filename pairs <OCR OUTPUT, GROUND TRUTH> is also included for each scanned book in the dataset. They are categorized by language.

· eng_pairs.txt

· fre_pairs.txt

· ger_pairs.txt

· spa_pairs.txt

Estimated OCR accuracies are also included:

· OCR_accuracy_list.xls

There are four alignment outputs for each book pair:

· <OCRfilename>-<GTfilename>.wcol : word level alignment output in the two-column format.

· <OCRfilename>-<GTfilename>.wline : word level alignment output in the line format.

· <OCRfilename>-<GTfilename>.ccol : char level alignment output in the two-column format.

· <OCRfilename>-<GTfilename>.cline : char level alignment output in the line format.

In the line format, each line contains 20 words or 100 characters. OCR output and the ground truth lines are indicated by "OCR:" and "GT :" tags respectively. Line format is more convenient for manual inspection of the alignment output.

In the two-column format, the OCR output is the first column. Words/characters which do not align with any word/character on the other sequence are aligned with empty string. In the alignment output, empty string and characters are indicated by "null" and "@" respectively.

Last updated: May 22, 2017