Skip to content

jimregan/ngramtool

Repository files navigation

NGramTool (http://www.nlplab.cn/zhangle/ngram.html)
===================================================

This package contains a set of tools for manipulating N-gram statistics from
large raw corpus.  The tools provided here distinguish themselves from other
N-Gram extraction programs in that:
    1. Unicode code base.
        All text are represented as Unicode (UCS16) internally for easily
        handling of oriental languages such as Chinese, Korean or Japanese.  

    2. Both Word N-gram and Character N-gram can be extracted.
        This feature is especially useful for Oriental language processing.
        Since in those languages (like Chinese) there is no explicit word
        boundary between words, and the basic processing units are usually
        single characters.

    3. Statistical Substring Reduction (SSR) algorithms (see below).
        Statistical Substring Reduction is a powerful procedure to remove
        redundant N-grams from an N-gram set using N-gram statistics
        information. For example, if both "People's republic of China" and
        "People's republic" occur 10 times in the corpus, the latter will be
        removed, for it is the substring of the former. The detail can be
        found in the following paper:

           "Statistical Substring Reduction in Linear Time"
           Lv Xue-qiang, Zhang Le and Hu Junfeng
           IJCNLP-04, Hai Nan island, P.R.China. 2004

Support Platforms
=================
Tested plasforms include: GNU/Linux, FreeBSD, NetBSD, MingW32, Cygwin.

NOTICE:
    If you use these programs on win32 platforms please specify the file
    name as the last argument of the command line option. All options given
    after the file name will be ignored! Therefore use `foo -a2 file' instead
    of `foo file -a2'.

Currently the following programs are provided:

text2ngram
=========================================================================
Extract N-Gram from raw corpus using an improved version of Nagao 1994
algorithm. Optionally the corpus can be parsed and saved to disk (in
Unicode) and extract N-grams later. Type text2ngram -h for a short help.
Here are some examples:
1. text2ngram -n3 file
Extract all word trigrams from a file (words are separated by blank and tab).
The output is sent to stdout.

2. text2ngram -n3 -m10 -f5 file
Extract word 3 to 10-gram whose frequency >= 5 from a text file
and output word N-grams to stdout.

3. text2ngram -c -n3 -m10 -f5 -F gbk -T gbk file
The same as above except that this time we extract *character* N-grams
(-c option) from a Chinese file encoded in GBK. The output is also GBK. If you
do not specify -F gbk or -T gbk option, the input/output encodings are assumed
to be UTF-8.

4. text2ngram -o corpus file
Parse text file and generate a parsed corpus (corpus.ptable and corpus.ltable) for later
word N-gram extraction (using `extractngram' utility). For more on what happens please
refer to (Nagao 1994) 

5. text2ngram -F gbk -c -o corpus file
The same as 4 but prepare for *character* N-gram extraction from a file encoded in GBK
(Chinese). 

extractngram
=========================================================================
Extract N-gram from parsed table file generated by `text2ngarm' program.
Example:
1. extractngram -n3 -m5 -i corpus
will extract word 3-gram, 4-gram and 5-gram from corpus.ngram corpus.ptable
and corput.ltable previously saved with `text2ngarm -o corpus' and write
results to stdout.  if -m5 is omitted, only 3-gram is extracted from ngram
table.

2. extractngram -c -T gbk -n3 -m5 -i corpus
The same as above example except that this time we extract *character* N-grams
(-c option) and output N-grams in GBK encoding. The corpus.* files must be
saved with `text2ngarm -c -o corpus' before.

strreduction
=========================================================================
Implement four Statistical Substring Reduction (SSR) algorithms.
The first algorithm has a O(n^2) complexity and is useful-less in real world
applications. It is included for completeness. The latter three SSR algorithms
all have an ideal time complexity of O(n), so can be used on very large N-gram
set.
example:
1. strreduction -a2  < input > output
   Perform word level SSR on the input stream using algorithm 2 (can be 1-4),
   and output the result to output. The input format is one "word  frequency"
   pair per line.

2. strreduction -a2 -c -F gbk -T gbk < input > output
   The same as the above example but we use character level SSR this time on
   Chinese input stream encoded in GBK (-F gbk). Output is GBK too (-T gbk).  The input format
   is one "word  frequency" pair per line.

3. strreduction -a2 -c -F gbk -T gbk -s -t -f 3 < input > output
   The same as the above example but we use a reduction threshold of 3 instead
   of the default value of 1 (-f 3). The output is also sorted (-s). Finally the SSR
   processing time is printed on stderr (-t).

For a detail introduction on SSR operation please refer to:

   "Statistical Substring Reduction in Linear Time"
   Lv Xue-qiang, Zhang Le and Hu Junfeng
   IJCNLP-04, Hai Nan island, P.R.China. 2004

For building and installation instructions please see the INSTALL file.

Author:
  Zhang Le <[email protected]>