Below is a list of the corpus tools we use at Text Corpus Labs. They are intended to be general purpose building blocks allowing for conversion between our different processes.
NOTE: This project is currently in the process of undergoing a retrofit. The below checklist now shows conversion status. While in progress, the old cold will still work, it is just nested in a subfolder.
You can install the package using the following steps:
pip
install using an admin prompt.pip uninstall buildingblocks python -OO -m pip install -v git+https://github.com/TextCorpusLabs/building-blocks.git
You can run the package in the following ways:
- Pull fields from every JSON object in a JSONL file into a CSV file
The following are optional parameters
buildingblocks extract jsonl_to_csv ` -source d:/data/corpus ` -dest d:/data/corpus.csv
fields
are the names of the fields to extract. It defaults to "id"
- Counts the n-grams in a JSONL file.
The following are optional parameters
buildingblocks transform ngram ` -source d:/data/corpus ` -dest d:/data/corpus.ngrams.csv
fields
are the names of the fields to process. It defaults to "text"size
is the length of the n-gram. It defaults to 1top
is the number of n-grams to save. It defaults to 10Kchunk
controls the amount of n-grams to chunk to disk to prevent OOM. Higher values use more ram, but compute the overall value faster. It defaults to 10M.keep_case
(flag) keeps the casing offields
as-is before converting to tokens for counting.keep_punct
(flag) keeps all punctuation offields
as-is before converting to tokens for counting.
All script commands are presented in PowerShell syntax. If you use a different shell, your syntax will be different.
Adding -O
to the front of any script runs it in "optimized" mode.
This can give as much as a 50% boost in some cases, but prevents errors from making sense.
If there is an error in a run, remove the -O
, capture the error, and submit an issue.
-
- Combine a folder of
JSON
files into a singleJSONL
file.
- Combine a folder of
-
- Combine a folder of
TXT
files into a singleJSONL
file.
- Combine a folder of
-
- Convert a
JSONL
file into a smallerJSONL
file by keeping only some elements.
- Convert a
-
- Convert a folder of
TXT
files into a folder of biggerTXT
files.
- Convert a folder of
-
- Convert a
JSONL
file into aJSONT
file.
- Convert a
-
- Convert a
JSONT
file into aJSONL
file.
- Convert a
-
- Extract a folder of interleaved
TXT
files from aJSONL
file.
- Extract a folder of interleaved
-
- Extract a folder of
JSON
files from a aJSONL
file.
- Extract a folder of
-
- Extract a folder of
TXT
files from aJSONL
file.
- Extract a folder of
-
- Merge several folders of
JSON
files into a single folder ofJSON
files based on their file name.
- Merge several folders of
-
- Merge several folders of
TXT
files into a single folder ofTXT
files based on their file name.
- Merge several folders of
-
- Tokenize a
JSONL
file using the NLTK defaults (Punkt + Penn Treebank).
- Tokenize a
Use the below instructions to setup the module for local development.
- Clone this repository then open an Admin shell to the
~/
directory. - Install the required modules.
pip uninstall buildingblocks pip install -e c:/repos/TextCorpusLabs/building-blocks
- Setup the
~/.vscode/launch.json
file (VS Code only)- Click the "Run and Debug Charm"
- Click the "create a launch.json file" link
- Select "Python"
- Select "module" and enter buildingblocks
- Select one of the following modes and add the below
args
to the launch.json file. Theargs
node should be a sibling of themodule
node. You will need to change your pathing and arguments. The first two arguments determine the command, the other arguments are the command's parameters."args" : [ "extract", "jsonl_to_csv", "-source", "d:/data/corpus", "-dest", "d:/data/corpus.csv", "-fields", "id,text"]