General Information

We know you're a professional, but here is some information you might find useful.

Use a virtual environment (venv) and the packages specified in the requirements.txt file.
Use the lm-dataformat library to create the resulting jsonl.zst file
Don't forget the metadata for each document(characters, sentences, words, verbs, nouns, punctuations, symbols) and manifest file (the most important source of data and rights). If in doubt, ask on the SpeakLeash discord.
The data must be shuffled.
In the README.md file, always add a Usage section and a few sentences on how to use the tool.
For processing large files, we recommend using tqdm, threading, and saving state (for the resume function).
Have fun!

Usage

Run example:

python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt