diff --git a/tutorials/tinystories/README.md b/tutorials/tinystories/README.md new file mode 100644 index 000000000..47074cb3f --- /dev/null +++ b/tutorials/tinystories/README.md @@ -0,0 +1,13 @@ +# TinyStories + +This tutorial demonstrates the usage of NeMo Curator's Python API to curate the [TinyStories](https://arxiv.org/abs/2305.07759) dataset. TinyStories is a dataset of short stories generated by GPT-3.5 and GPT-4, featuring words that are undersood by 3 to 4-year olds. The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine. + +For simplicity, this tutorial uses the validation split of this dataset, which contains around 22,000 samples. + +## Usage +After installing the NeMo Curator package, you can simply run the following command: +``` +python tutorials/tinystories/main.py +``` + +This will download the validation split of the TinyStories dataset and begin the data curation pipeline.