UMASS Language Identification Tool for Long Noisy Texts - Version 1.0

Copyright (C) 2013 by the University of Massachusetts at Amherst

Written by I. Zeki Yalniz

Purpose:

UMASS language identifier tool is designed for estimating the language distribution of long noisy texts, such as OCR outputs of scanned book collections. Current version works for "english", "french", "german", "spanish", "italian", "latin", "portuguese", "dutch", "danish" and "swedish". There is also an unknown language field which indicates the existence of some text written either in a language which is not listed above or the text has some OCR errors. In a nut-shell, this is achieved by finding the frequency of top 5 stopwords in the text and this is done for each language. These frequencies are later used for estimating the size of the text which can generate the stopwords in each language. Each language obtains a score accordingly. If there is any remaining portion of the text for which we do not know the source language, then that portion is labeled as the unknown language. It should be noted that this approach is different from letter n-gram based approaches.  Please read the rest of this page before proceeding.

This code is a prototype and still under development. It can be easily extended for other languages by

1 - writing/using a proper TextPreprocessor object and

2 - learning a stopword list along with term probabilities for the intended languages.

IMPORTANT NOTICE:

This software was developed at the Center for Intelligent Information Retrieval (CIIR), University of Massachusetts Amherst.  Basic research to develop the software was funded by the CIIR and the National Science Foundation while its application was supported by a grant from the Mellon Foundation. Any opinions, findings and conclusions or recommendations expressed in this material are the authors' and do not necessarily reflect those of the sponsor.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by  the Free Software Foundation, either version 3 of the License, or  (at your option) any later version.  This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.  You should have received a copy of the GNU General Public License along with this program.  If not, see <http://www.gnu.org/licenses/>.

Citation Information:

If you make use of this tool, please cite this web page as given below:

@misc{yalniz13langid,

 Author = {I. Z. Yalniz and R. Manmatha},

 Title = {{UMASS} Language Identification Tool for Long Noisy Texts},

 Year  = {2013},

 Howpublished = {\url{http://ciir.cs.umass.edu/downloads/language-identification/}}

 

How to obtain the source code?

Please contact

I.                    I. Zeki Yalniz: zeki[at]cs[dot]umass[dot]edu    OR

II.                  R. Manmatha:  manmatha[at]cs[dot]umass[dot]edu     OR 

III.                downloads[at]ciir[dot]cs[dot]umass[dot]edu

A download link will be provided. 

How to compile:

Inside the source folder, type the following command to compile the code (tested for Java version 1.6):

"javac *.java"

How to use the tool?

A - COMMAND LINE INTERFACE:

USAGE: java LanguageIdentifierTool <inputFileORfolderName>

PARAMETERS:

<inputFileORfolderName>

                full path for input text file or folder name. If the input filename is a folder, all the files in the folder are processed RECURSIVELY.

SAMPLE USAGE(s):

1 - "java LanguageIdentifier /desktop/myfolder/"

    runs the language identifier on all the books residing in the designated folder RECURSIVELY (screen output).

2 - "java LanguageIdentifier /desktop/myfolder/myfile.txt"

    runs the language identifier on a single file (screen output).

SAMPLE SCREEN OUTPUT:

/desktop/myfolder/myfile.txt   eng 0.7%              ger 97.4%            fre 0.1% ...

 

B - JAVA API

 First, initialize the language identifier tool as follows:

 String langs[] = {"english", "french", "german", "spanish", "italian", "latin", "portuguese", "dutch", "danish", "swedish"};

 LanguageIdentifierTool lid = new LanguageIdentifierTool(langs);

 1 - To run the language identifier on a single text file

 lid.processFiles(fullPathOfTheTextFile);

 2 - To run the language identifier for all the files in a given folder

 lid.processFiles(fullPathOfTheFolder);

 3 - To run the language identifier on a list of text files residing in the "workingFolderPath" folder. The "fileList" is a text file including one file path per line. File paths are relative to the "workingFolderFullPath".

 lid.processFilesInTheList(workingFolderFullPath, fileList);