diff --git a/.gitignore b/.gitignore index c4957c3..34e36b0 100644 --- a/.gitignore +++ b/.gitignore @@ -2,6 +2,7 @@ __pycache__/ *.py[cod] *$py.class +_debug_data/ # C extensions *.so diff --git a/README.md b/README.md index 69ae97a..1d81528 100644 --- a/README.md +++ b/README.md @@ -95,6 +95,12 @@ Directory listing of 'axo-2024-04-13-19-13-05-0fb0/' in 'example-runs-vol' The LorA adapters are stored in `lora-out`. The merged weights are stored in `lora-out/merged `. Many inference frameworks can only load the merged weights, so it is handy to know where they are stored. +## Inspecting Flattened Data + +One of the key features of axolotl is that it flattens your data from a JSONL file into a prompt template format you specify in the config. These are where most mistakes are made when fine-tuning. + +See the [nbs/inspect_data.ipynb](nbs/inspect_data.ipynb) notebook for guide on how to inspect your data and ensure it is being flattened correctly. We recommend that you always inspect your data the first time you fine-tune a model on a new dataset. + ## Development ### Code overview diff --git a/nbs/inspect_data.ipynb b/nbs/inspect_data.ipynb new file mode 100644 index 0000000..855c85a --- /dev/null +++ b/nbs/inspect_data.ipynb @@ -0,0 +1,300 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a0cdf879-4e31-4c20-85dc-c7aea1f5eb66", + "metadata": {}, + "source": [ + "# Inspecting Flattened Data\n", + "\n", + "One of the key features of axolotl is that it flattens your data from a JSONL file into a prompt template format you specify in the config. In the case of the text to SQL dataset defined in [mistral.yml](config/mistral.yml), the prompt template is defined as:\n", + "\n", + "\n", + "```yaml\n", + "datasets:\n", + " # This will be the path used for the data when it is saved to the Volume in the cloud.\n", + " - path: data.jsonl\n", + " ds_type: json\n", + " type:\n", + " # JSONL file contains question, context, answer fields per line.\n", + " # This gets mapped to instruction, input, output axolotl tags.\n", + " field_instruction: question\n", + " field_input: context\n", + " field_output: answer\n", + " # Format is used by axolotl to generate the prompt.\n", + " format: |-\n", + " [INST] Using the schema context below, generate a SQL query that answers the question.\n", + " {input}\n", + " {instruction} [/INST]\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "c21d4907-e56d-40ab-aa0e-42a551e46c52", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "Make sure you install the following dependencies first:\n", + "\n", + "```bash\n", + "pip install -U transformers datasets\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4e06c2c-edc0-4251-a479-bfeb6b0b7175", + "metadata": {}, + "outputs": [], + "source": [ + "import yaml, os\n", + "from pathlib import Path\n", + "from datasets import load_from_disk\n", + "from transformers import AutoTokenizer" + ] + }, + { + "cell_type": "markdown", + "id": "5202f6e0-db36-445b-84bc-b9cda090f5da", + "metadata": {}, + "source": [ + "## Step 1: Preprocess Data" + ] + }, + { + "cell_type": "markdown", + "id": "649b5567-bc68-4c5f-ac5e-a7579b5ab648", + "metadata": {}, + "source": [ + "It is often useful to just preprocess the data and inspect it before training. You can do this by passing the `--preproc-only` flag to the `train` command. This will preprocess the data and write it to the `datasets.path` specified in your config file. You can then inspect the data and make sure it is formatted correctly before training.\n", + "\n", + "For example, to preprocess the data for the `mistral.yml` config file, you would run:\n", + "\n", + "```bash\n", + "# run this from the root of the repo\n", + "modal run --detach src.train \\\n", + " --config=config/mistral.yml\\\n", + " --data=data/sqlqa.jsonl\\\n", + " --preproc-only \n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "625d37b1-1073-4129-bdb2-abd574c67925", + "metadata": {}, + "source": [ + "Modal will give you a run-id, which allows you to get the preprocessed data. For example, you will see something like this in the logs:\n", + "\n", + "```\n", + "Training complete. Run tag: axo-2024-05-09-19-04-56-90c0\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "e272bea2-b198-470b-ad81-694b260fdeea", + "metadata": {}, + "source": [ + "### Step 2: Download Data\n", + "\n", + "The Run tag can be used to download and inspect the preprocessed data with [modal volume](https://modal.com/docs/reference/cli/volume):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f88c7d7a-b2a6-4f09-8392-81fb3b4cf9a8", + "metadata": {}, + "outputs": [], + "source": [ + "# change this to your run tag\n", + "RUN_TAG='axo-2024-05-09-19-04-56-90c0'" + ] + }, + { + "cell_type": "markdown", + "id": "a2588366-3788-4f06-9158-0d96dbfeebff", + "metadata": {}, + "source": [ + "inspect the directory structure" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d3c7ddb-8ab5-4e9b-9916-47a8c4226719", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Directory listing of \u001b[32m'axo-2024-05-09-19-04-56-90c0'\u001b[0m in \u001b[32m'example-runs-vol'\u001b[0m\n", + "┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓\n", + "┃\u001b[1m \u001b[0m\u001b[1mfilename \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mtype\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mcreated/modified \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1msize \u001b[0m\u001b[1m \u001b[0m┃\n", + "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩\n", + "│ axo-2024-05-09-19-04-56-90c0/last_r… │ dir │ 2024-05-09 12:05 PDT │ 32 B │\n", + "│ axo-2024-05-09-19-04-56-90c0/logs.t… │ file │ 2024-05-09 12:04 PDT │ 66 B │\n", + "│ axo-2024-05-09-19-04-56-90c0/data.j… │ file │ 2024-05-09 12:04 PDT │ 1.3 MiB │\n", + "│ axo-2024-05-09-19-04-56-90c0/config… │ file │ 2024-05-09 12:04 PDT │ 1.7 KiB │\n", + "└──────────────────────────────────────┴──────┴──────────────────────┴─────────┘\n" + ] + } + ], + "source": [ + "! modal volume ls example-runs-vol {RUN_TAG}" + ] + }, + { + "cell_type": "markdown", + "id": "093f3f73-9bc4-43eb-8693-f47a0352dfff", + "metadata": {}, + "source": [ + "download the preprocessed data locally into a directory called _debug_data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb72f0d2-2a67-4713-8980-bce404044261", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wrote 278 bytes to _debug_data/axo-2024-05-09-19-04-56-90c0/last_run_prepared/f296ca80661a80bf05a90dcbd89b0525/state.json\n", + "Wrote 355 bytes to _debug_data/axo-2024-05-09-19-04-56-90c0/last_run_prepared/f296ca80661a80bf05a90dcbd89b0525/dataset_info.json\n", + "Wrote 5512792 bytes to _debug_data/axo-2024-05-09-19-04-56-90c0/last_run_prepared/f296ca80661a80bf05a90dcbd89b0525/data-00000-of-00001.arrow\n" + ] + } + ], + "source": [ + "!rm -rf _debug_data\n", + "!modal volume get example-runs-vol {RUN_TAG}/last_run_prepared _debug_data" + ] + }, + { + "cell_type": "markdown", + "id": "b87bb506-c7a1-4c88-8c14-2bec0d0f7c11", + "metadata": {}, + "source": [ + "### Step 3: Analyze Data" + ] + }, + { + "cell_type": "markdown", + "id": "4daadf16-01a8-4e99-a76a-ce061430e301", + "metadata": {}, + "source": [ + "Get the right tokenizer" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "17399b9e-2d9c-4b98-84f6-c096c43a447a", + "metadata": {}, + "outputs": [], + "source": [ + "with open('../config/mistral.yml', 'r') as f:\n", + " cfg = yaml.safe_load(f)\n", + "model_id = cfg['base_model']\n", + "tok = AutoTokenizer.from_pretrained(model_id)" + ] + }, + { + "cell_type": "markdown", + "id": "f263bc35-6ff8-4b8e-8198-5bf71caaf6fc", + "metadata": {}, + "source": [ + "Load the dataset into a HF Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df92646f-1069-4375-9166-27dd89d03433", + "metadata": {}, + "outputs": [], + "source": [ + "ds_dir = Path(f'_debug_data/{RUN_TAG}/last_run_prepared')\n", + "ds_path = [p for p in ds_dir.iterdir() if p.is_dir()][0]\n", + "ds = load_from_disk(str(ds_path))" + ] + }, + { + "cell_type": "markdown", + "id": "485f7b69-bb69-454e-ad1a-f4b6ca730b8e", + "metadata": {}, + "source": [ + "Verify that the data looks good" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a51a5ecd-4b8f-4e8b-a44b-3d4b5abdfa65", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " [INST] Using the schema context below, generate a SQL query that answers the question.\n", + "CREATE TABLE head (age INTEGER)\n", + "How many heads of the departments are older than 56 ? [/INST] [SQL] SELECT COUNT(*) FROM head WHERE age > 56 [/SQL]\n" + ] + } + ], + "source": [ + "print(tok.decode(ds['input_ids'][0]))" + ] + }, + { + "cell_type": "markdown", + "id": "74864064-1a00-4c14-b13c-ab6ac727a943", + "metadata": {}, + "source": [ + "## Resume A Training Run" + ] + }, + { + "cell_type": "markdown", + "id": "423bf5e0-67c3-445e-9162-6e2b126a7b65", + "metadata": {}, + "source": [ + "After you have inspected the data and you are satisified with the results, you can resume training the model without having to preprocess the data again. This is made possible with the `--run-to-resume` flag. For example, to resume the training run on this example, I can run this command:\n", + "\n", + "```bash\n", + "# run this from the root of the repo\n", + "modal run --detach src.train \\\n", + " --config=config/mistral.yml\\\n", + " --data=data/sqlqa.jsonl\\\n", + " --run-to-resume {RUN_TAG} \n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27476073-273e-46b0-bbb9-6760ff87afbb", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "python3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}