We know you're a professional, but here is some information you might find useful.
- Use a virtual environment (venv) and the packages specified in the requirements.txt file.
- Use the lm-dataformat library to create the resulting jsonl.zst file
- Don't forget the metadata for each document(characters, sentences, words, verbs, nouns, punctuations, symbols) and manifest file (the most important source of data and rights). If in doubt, ask on the SpeakLeash discord.
- The data must be shuffled.
- In the README.md file, always add a Usage section and a few sentences on how to use the tool.
- For processing large files, we recommend using tqdm, threading, and saving state (for the resume function).
- Have fun!
Run example:
python main.py