README for Array::Suffix ========================= Array::Suffix is a perl module to determine variable length ngrams from large corpora using the data structure suffix arrays. The module: 1. Provides an easy to use interface to determine ngrams from a corpus. Some of the basic functionality include: * returns variable length ngrams * allow for a stop list * allows for a frequency cutoff * allows for a remove cutoff INSTALLATION There are multiple ways to install this package. 1. You can use CPAN.pm to install Array::Suffix. To install type the following: perl -MCPAN -e 'install Array-Suffix' 2. Or you can install this yourself. To install this module type the following: perl Makefile.PL make make test make install DEPENDENCIES This module does not require any other modules or libraries. PROGRAM : array-suffix-driver.pl This program takes as input a flat ASCII text file and outputs all Ngrams, or token sequences of length 'n', where the value of 'n' can be decided by the user, and the frequency of the ngram. Using array-suffix-driver.pl The most basic way of running this program is the following: % array-suffix-driver.pl output.txt input.txt where input.txt is the input text file in which to find the Ngrams and output.txt is the output file into which count.pl will put all the Ngrams with their frequencies. Changing the Length of Ngrams The default ngram size is 2. This can be changed by using the parameter option --ngram N, where N is the number of tokens in each ngram. For example, to find all the trigrams in the file input.txt, you would running program: %count.pl --ngram 3 output.txt input.txt Using User-Provided Token Definitions: The default token definitions are: \w+ -> this matches a contiguous sequence of alpha-numeric characters [\.,;:\?!] -> this matches a single punctuation mark The default token definitions can be over-ridden by using the option: --token FILE where FILE is the name of the file containing the regular expressions on which the token definitions will be based. Each regular expression in this FILE should be: 1. on a line of its own 2. should be delimited by the forward slash '/'. 3. should be valid Perl regular expressions Removing character strings This option --nontoken FILE allows a user to define regular expressions that will match strings that should not be considered as tokens. These strings will be removed from the data and not counted or included in Ngrams. The --nontoken option is recommended when there are predictable sequences of characters that you know should not be included as tokens for purposes of counting Ngrams, finding collocations, etc. For example, if mark-up symbols like ,

, [item], [/ptr] exist in text being processed, you may want to include those in your list of nontoken items so they are discarded. If not, a simple regex such as /\w+/ will match with 's', 'p', 'item', 'ptr' from these tags, leading to confusing results. The FILE following the nontoken option file should contain Perl regular expressions delimited by forward slashes '/' that define non-tokens. Multiple expressions may be placed on separate lines or be separated via the '|' (Perl 'or') as in /regex1|regex2|../ The following are some of the examples of valid non-token definitions: /<\/?s|p>/ : will remove xml tags like ,

, ,

. /\[\w+\]/ : will remove all words which appear in square brackets like [p], [item], [123] and so on. The program will first remove any string from the input data that matches the non-token regular expression, and only then will match the remaining data against the token definitions. The Output Format Assume that the following are the contents of the input text file to array-suffix-driver.pl; let us call the file test.txt: first line of text second line and a third line of text Assume that array-suffix-driver.pl is run in its most general mode: % array-suffix-driver.pl test.out test.txt The output will contain all the bigrams found in the file test.txt uing the default tokens as specified above. The contents of the output file test.out would be: 11 line<>of<>2 of<>text<>2 second<>line<>1 line<>and<>1 and<>a<>1 a<>third<>1 first<>line<>1 third<>line<>1 text<>second<>1 The number on the first line, 11, indicates that there were 11 bigrams in test.txt Following are the bigrams that were found in the test.txt file delimited by the diamond sign, "<>". Therefore the first bigram is line<>of<>, make up of the tokens "line" and "of" in that order. After the diamond following the last token there is a number, this number denotes how many times this bigram occurred in the text. The Marginals Option To obtain the a partial set of marginal counts for the bigram the option: --marginals must be set. This option outputs the individual frequency counts of each token in the ngram. Let us use our example from above but run the array-suffix-driver.pl program as follows: % array-suffix-driver.pl --marginals test.out test.txt The output will contain all the bigrams found i the file test.txt using the default tokens as specified above, their frequency counts and the number of times each of the tokens in the bigram occurred in their respective positions. The contents of the output file test.out would be: 11 line<>of<>2 3 2 of<>text<>2 2 2 second<>line<>1 1 3 line<>and<>1 3 1 and<>a<>1 1 1 a<>third<>1 1 1 first<>line<>1 1 3 third<>line<>1 1 3 text<>second<>1 1 1 The first number after the bigram is the frequency of the bigram seen in test.out. The second number after the bigram is the number of times the first token was seen in the first position of all the bigrams and the second number is the number of times the second token was seen in the second position of all the bigrams. Stoplists The user may "stop" the Ngrams formed by array-suffix-driver.pl by providing a list of stop-tokens through the option: --stop FILE. Each stop token in FILE should be a Perl regular expression that occurs on a line by itself. This expression should be delimited by forward slashes, as in /REGEX/. All regular expression capabilities in Perl are supported except for regular expression modifiers (like the "i" /REGEX/i). The following are a few examples of valid entries in the stop list. /^\d+$/ /\bthe\b/ /\b[Tt][Hh][Ee]\b/ /^and$/ /\bor\b/ /^be(ing)?$/ There are two modes in which a stop list can be used, AND and OR. The default mode is AND, which means that an Ngram must be made up entirely of words from the stoplist before it is eliminated. The OR mode eliminates an Ngram if any of the words that make up the Ngram are found in the stoplist. Removing Low Frequency Ngrams: We allow the user to either remove or to not display low frequency Ngrams. The user can remove low frequency Ngrams by using the option : --remove N by which all Ngrams that occur less than n times are removed. The Ngram and the individual frequency counts are adjusted accordingly upon the removal of these Ngrams. The user can choose not to display low frequency Ngrams by using the option : --frequency N, by which Ngrams that occur less than n times are not displayed in the output. Note that this differs from the remove option above in that the frequency counts are not changed. COPYRIGHT AND LICENCE Copyright (C) 2004, Bridget Thomson McInnes This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: a copy of the GNU Free Documentation License is available on the web at L and is included in this distribution as FDL.txt. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.