-
Notifications
You must be signed in to change notification settings - Fork 85
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
307 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
_debug_data/ | ||
|
||
# C extensions | ||
*.so | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,300 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a0cdf879-4e31-4c20-85dc-c7aea1f5eb66", | ||
"metadata": {}, | ||
"source": [ | ||
"# Inspecting Flattened Data\n", | ||
"\n", | ||
"One of the key features of axolotl is that it flattens your data from a JSONL file into a prompt template format you specify in the config. In the case of the text to SQL dataset defined in [mistral.yml](config/mistral.yml), the prompt template is defined as:\n", | ||
"\n", | ||
"\n", | ||
"```yaml\n", | ||
"datasets:\n", | ||
" # This will be the path used for the data when it is saved to the Volume in the cloud.\n", | ||
" - path: data.jsonl\n", | ||
" ds_type: json\n", | ||
" type:\n", | ||
" # JSONL file contains question, context, answer fields per line.\n", | ||
" # This gets mapped to instruction, input, output axolotl tags.\n", | ||
" field_instruction: question\n", | ||
" field_input: context\n", | ||
" field_output: answer\n", | ||
" # Format is used by axolotl to generate the prompt.\n", | ||
" format: |-\n", | ||
" [INST] Using the schema context below, generate a SQL query that answers the question.\n", | ||
" {input}\n", | ||
" {instruction} [/INST]\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c21d4907-e56d-40ab-aa0e-42a551e46c52", | ||
"metadata": {}, | ||
"source": [ | ||
"## Prerequisites\n", | ||
"\n", | ||
"Make sure you install the following dependencies first:\n", | ||
"\n", | ||
"```bash\n", | ||
"pip install -U transformers datasets\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "e4e06c2c-edc0-4251-a479-bfeb6b0b7175", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import yaml, os\n", | ||
"from pathlib import Path\n", | ||
"from datasets import load_from_disk\n", | ||
"from transformers import AutoTokenizer" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "5202f6e0-db36-445b-84bc-b9cda090f5da", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step 1: Preprocess Data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "649b5567-bc68-4c5f-ac5e-a7579b5ab648", | ||
"metadata": {}, | ||
"source": [ | ||
"It is often useful to just preprocess the data and inspect it before training. You can do this by passing the `--preproc-only` flag to the `train` command. This will preprocess the data and write it to the `datasets.path` specified in your config file. You can then inspect the data and make sure it is formatted correctly before training.\n", | ||
"\n", | ||
"For example, to preprocess the data for the `mistral.yml` config file, you would run:\n", | ||
"\n", | ||
"```bash\n", | ||
"# run this from the root of the repo\n", | ||
"modal run --detach src.train \\\n", | ||
" --config=config/mistral.yml\\\n", | ||
" --data=data/sqlqa.jsonl\\\n", | ||
" --preproc-only \n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "625d37b1-1073-4129-bdb2-abd574c67925", | ||
"metadata": {}, | ||
"source": [ | ||
"Modal will give you a run-id, which allows you to get the preprocessed data. For example, you will see something like this in the logs:\n", | ||
"\n", | ||
"```\n", | ||
"Training complete. Run tag: axo-2024-05-09-19-04-56-90c0\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e272bea2-b198-470b-ad81-694b260fdeea", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 2: Download Data\n", | ||
"\n", | ||
"The Run tag can be used to download and inspect the preprocessed data with [modal volume](https://modal.com/docs/reference/cli/volume):" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "f88c7d7a-b2a6-4f09-8392-81fb3b4cf9a8", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# change this to your run tag\n", | ||
"RUN_TAG='axo-2024-05-09-19-04-56-90c0'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a2588366-3788-4f06-9158-0d96dbfeebff", | ||
"metadata": {}, | ||
"source": [ | ||
"inspect the directory structure" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "3d3c7ddb-8ab5-4e9b-9916-47a8c4226719", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Directory listing of \u001b[32m'axo-2024-05-09-19-04-56-90c0'\u001b[0m in \u001b[32m'example-runs-vol'\u001b[0m\n", | ||
"┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓\n", | ||
"┃\u001b[1m \u001b[0m\u001b[1mfilename \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mtype\u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mcreated/modified \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1msize \u001b[0m\u001b[1m \u001b[0m┃\n", | ||
"┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩\n", | ||
"│ axo-2024-05-09-19-04-56-90c0/last_r… │ dir │ 2024-05-09 12:05 PDT │ 32 B │\n", | ||
"│ axo-2024-05-09-19-04-56-90c0/logs.t… │ file │ 2024-05-09 12:04 PDT │ 66 B │\n", | ||
"│ axo-2024-05-09-19-04-56-90c0/data.j… │ file │ 2024-05-09 12:04 PDT │ 1.3 MiB │\n", | ||
"│ axo-2024-05-09-19-04-56-90c0/config… │ file │ 2024-05-09 12:04 PDT │ 1.7 KiB │\n", | ||
"└──────────────────────────────────────┴──────┴──────────────────────┴─────────┘\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"! modal volume ls example-runs-vol {RUN_TAG}" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "093f3f73-9bc4-43eb-8693-f47a0352dfff", | ||
"metadata": {}, | ||
"source": [ | ||
"download the preprocessed data locally into a directory called _debug_data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "cb72f0d2-2a67-4713-8980-bce404044261", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Wrote 278 bytes to _debug_data/axo-2024-05-09-19-04-56-90c0/last_run_prepared/f296ca80661a80bf05a90dcbd89b0525/state.json\n", | ||
"Wrote 355 bytes to _debug_data/axo-2024-05-09-19-04-56-90c0/last_run_prepared/f296ca80661a80bf05a90dcbd89b0525/dataset_info.json\n", | ||
"Wrote 5512792 bytes to _debug_data/axo-2024-05-09-19-04-56-90c0/last_run_prepared/f296ca80661a80bf05a90dcbd89b0525/data-00000-of-00001.arrow\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!rm -rf _debug_data\n", | ||
"!modal volume get example-runs-vol {RUN_TAG}/last_run_prepared _debug_data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "b87bb506-c7a1-4c88-8c14-2bec0d0f7c11", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 3: Analyze Data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "4daadf16-01a8-4e99-a76a-ce061430e301", | ||
"metadata": {}, | ||
"source": [ | ||
"Get the right tokenizer" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "17399b9e-2d9c-4b98-84f6-c096c43a447a", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"with open('../config/mistral.yml', 'r') as f:\n", | ||
" cfg = yaml.safe_load(f)\n", | ||
"model_id = cfg['base_model']\n", | ||
"tok = AutoTokenizer.from_pretrained(model_id)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "f263bc35-6ff8-4b8e-8198-5bf71caaf6fc", | ||
"metadata": {}, | ||
"source": [ | ||
"Load the dataset into a HF Dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "df92646f-1069-4375-9166-27dd89d03433", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ds_dir = Path(f'_debug_data/{RUN_TAG}/last_run_prepared')\n", | ||
"ds_path = [p for p in ds_dir.iterdir() if p.is_dir()][0]\n", | ||
"ds = load_from_disk(str(ds_path))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "485f7b69-bb69-454e-ad1a-f4b6ca730b8e", | ||
"metadata": {}, | ||
"source": [ | ||
"Verify that the data looks good" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "a51a5ecd-4b8f-4e8b-a44b-3d4b5abdfa65", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"<s> [INST] Using the schema context below, generate a SQL query that answers the question.\n", | ||
"CREATE TABLE head (age INTEGER)\n", | ||
"How many heads of the departments are older than 56 ? [/INST] [SQL] SELECT COUNT(*) FROM head WHERE age > 56 [/SQL]</s>\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"print(tok.decode(ds['input_ids'][0]))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "74864064-1a00-4c14-b13c-ab6ac727a943", | ||
"metadata": {}, | ||
"source": [ | ||
"## Resume A Training Run" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "423bf5e0-67c3-445e-9162-6e2b126a7b65", | ||
"metadata": {}, | ||
"source": [ | ||
"After you have inspected the data and you are satisified with the results, you can resume training the model without having to preprocess the data again. This is made possible with the `--run-to-resume` flag. For example, to resume the training run on this example, I can run this command:\n", | ||
"\n", | ||
"```bash\n", | ||
"# run this from the root of the repo\n", | ||
"modal run --detach src.train \\\n", | ||
" --config=config/mistral.yml\\\n", | ||
" --data=data/sqlqa.jsonl\\\n", | ||
" --run-to-resume {RUN_TAG} \n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "27476073-273e-46b0-bbb9-6760ff87afbb", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "python3", | ||
"language": "python", | ||
"name": "python3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |