QALP::IdentifierSplitter version 0.01 ===================================== IdentifierSplitter contains several functions which allow you to split compound source code identifiers (or any sequence of concatenated words) into their constituent parts. For example, `spongebobsquarepants' will be split into `sponge_bob_square_pants'. A greedy algorithm is used which attempts to find the longest prefixes and suffixes of a identifier which are found in a dictionary (this is supplied by the user, see DICTIONARY below). There is no probabilistic element in this algorithm, so an identifier such as `thenewestone' will be split as `then_ewe_stone' instead of `the_newest_one'. The first one has a 3-, 4-, and 5-letter word whereas the second one has two 3- and one 6-letter word; since the first has more `longer' words, it is the preferred split. The second split makes more sense and is, in fact, the correct split, but it has only one `long' word, so it looses. In general, the heart of IdentifierSplitter is free of language restrictions (it should process French the same as English). However, there is one function in that applies only to languages that use 's' for the plural. This function is called "filter_plurals" and is itself a quick and dirty hack to find instances when an identifier is improperly split due to a term's plural not occurring in the dictionary. However, this function only works in certain cases. If you would like to not use the 'filter_plurals' function, then pass "-noFilter" to 'isplit_basic.pl' or 'isplit_full.pl' as the FIRST argument. SCRIPT CONTENTS There are three scripts/modules: QALP::IdentifierSplitter.pm This is the actual Perl module which contains all of the functions for the splitting algorithm and file processing. To use it, include the line: use QALP::IdentifierSplitter ':all'; at the top of your Perl script. This file is very messy, however, it was 1) a prototype and 2) my (Henry's) very first Perl program. While I have packaged it into a Perl Module, it was originally one stand-alone script. Therefore, no object-oriented programming practices were followed :(. I suggest using the two accompanying scripts until a later release makes it easier to interact with the main module. To view the documentation for the module, at a command prompt type: perldoc QALP::IdentifierSplitter isplit_basic.pl This script is the simpler of the two and will take an input file of identifiers (separated by whitespace -- it doesn't matter what kind or how much!) and split them using the dictionar[y/ies] and stoplist[s] given at runtime. See "DEPENDENCIES" below or the script's perldoc page for more on the dictionaries and stoplists. Its output is: To view the documentation for this script, at a command prompt, type: perldoc isplit_basic.pl The usage message can also be viewed by typing "isplit_basic.pl" with no arguments. isplit_full.pl This script requires several extra arguments, such as the name of the program from which the identifiers were extracted and the start and release date of the project. These demographics are strictly for the output and are not used during the actual identifier splitting. The input file to isplit_full.pl is slightly different from its 'basic' counterpart: the line number on which the identifier was located in the source program is expected at the beginning of each line in the input. Every white-space delimited column past that is assumed to be an identifier which was found on that line. Multiple lines in input can have the same source line number without causing problems. All source lines on which an identifier appeared are printed with that identifier's split in the output. Its output is (in order of their appearance on each line): (guessed using the extension of 'file'.) (not the input file!) (...never used...) (referring to the identifier on the line) (i.e., hard parts by camelCasing or _) (with splits represented by '_'s) (separated by commas) Whew...how about that output?? Well, this is a legacy..."feature"... This is information we used for the first research project for which this splitter was required. It involved a longitudinal study of identifiers in programs and their various versions. In fact, much of the module code reflects this and that is one reason it is such a mess... To view the documentation for this script, at a command prompt, type: perldoc isplit_full.pl INSTALLATION This gives a quick overview of the installation process. See INSTALL for more information, including what environment variables need to be updated. To install this module in the default location, type the following: perl Makefile.PL make sudo make install To install in a local directory, type: perl Makefile.PL PREFIX=path/to/install/dir make make install DEPENDENCIES While there are no actual dependencies to use the IdentifierSplitter module or the scripts that come with it, it is helpful to have some sort of dictionary of words (i.e. a list of valid words, such as the 'english.words' file that comes bundled with most Linux distributions) as well as programming language stoplists (i.e. terms that are a part of a programming language, such as "double" or "int"). Both types of lists should be in the following format: word1 ; description word2 word3 word4 ; description ... such that the "; description" portions are optional. The easiest way to use these is to create two directories: dict/ stp/ In the dict directory, include the following files (even if they are empty): words <-- this can be linked to the system dict (e.g. /usr/share/dict/words) prg_terms <-- programming terms/acronyms/etc. abbrevs <-- a list of common abbreviations/acronyms In stp, you may want to include a list of C, C++, and/or Java reserved terms. These can be listed any way you would like. All files in the stop-list directory will be read. HISTORY 0.01 isplit_full.pl and isplit_basic.pl were created on 28 Dec 2007. The functions in QALP::IdentifierSplitter were created in August 2005, and packaged in a Perl module on 28 Dec 2007. Released 14 Feb 2008. CONTACT If you have questions or comments, please contact Henry Feild at . SEE ALSO perldoc isplit_full.pl, perldoc isplit_basic.pl, perldoc QALP::IdentifierSplitter http://www.cs.loyola.edu/~hfeild/downloads.html#qalp COPYRIGHT AND LICENCE Copyright (C) 2007 by David Binkley, Henry Feild, and Dawn Lawrie This program is part of Identifier Splitter. Identifier Splitter is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see .