NGram counter for large corpuses
You can install the package using the following steps:
pip
install using an admin prompt.
pip uninstall VLNGramCounter -y
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/VLNGramCounter.git
or if you have the code local
pip uninstall VLNGramCounter -y
python -OO -m pip install -v c:/repos/TextCorpusLabs/VLNGramCounter
Counts the n-grams contained in a folder of TXT files.
VLNGramCounter -source d:/data/corpus -dest d:/data/corpus.ngrams.csv
The following are required parameters:
source
is the folder containing the TXT files.dest
is the CSV file used to store the ngram results.
The following are optional parameters:
length
is the length of the n-gram. The default is 1.chunk_size
is the amount of items in used by the control structure before chunking. Higher values use more ram, but compute the overall value faster. The default is 1M.include
count only values in this CSV list. The default is count everything.exclude
ignore values in this CSV list. The default is exclude nothing. Note: due to the order of operations, it only makes seance toexclude
single tokens.cutoff
is the minimum value count to keep. The default is 2.top
is the number of n-grams to save. The default is to keep 10K.keep_case
(flag) keeps the casing as-is before converting to tokens for counting. The default is to upper case everything.keep_punct
(flag) keeps all punctuation as-is before converting to tokens for counting. The default is to remove all tokens that are only punctuation.
NOTE: The order of operations for complex counting is as follows:
- Transformation (
keep_case
) - Exclusion (
keep_punct
>exclude
) - Inclusion (
include
) - Filter (
cutoff
>top
)
The code in this repo is setup as a module. Debugging and testing are based on the assumption that the module is already installed. In order to debug (F5) or run the tests (Ctrl + ; Crtl + A), make sure to install the module as editable (see below).
pip uninstall VLNGramCounter -y
python -m pip install -e c:/repos/TextCorpusLabs/VLNGramCounter
When debugging in VSCode for the first time, consider adding the below config to the launch.json file.
"args" : [
"-source", "d:/data/corpus",
"-dest", "d:/data/corpus.ngrams.csv",
"-length", "1"]