Update Zephyr configs to account for UltraFeedback & TRL fixes #88

lewtun · 2024-01-04T04:55:31Z

Since the release of zephyr-7b-beta there have been several important developments to the code and data used to train this model:

UltraFeedback was fixed to correct ~few thousand incorrect labels and the community pointed out it is better to filter out the TruthfulQA subset to avoid contamination with the OpenLLM leaderboard
The learning rate scheduler in trl was not working correctly with packing which affects the way we train the initial SFT model for downstream optimisation
Gradient checkpointing was not available for DPO when we trained zephyr-7b-beta, but now is via the use_reentrant:True arg

Given these changes, we've decided to do a full re-run of the Zephyr recipe to reconstruct a new set of hyperparameters that "just work" for full training and QLoRA.

The most notable changes include:

Promoting QLoRA as the main alternative to full training. One issue with LoRA + ZeRO-3 is that it's not possible to load adapters on sharded models (e.g. you can't easily load the SFT and DPO adapters first and then shard the resulting model). Given that Zephyr is just a 7B model, QLoRA works great with DDP and is the simpler alternative to promote.
beta=0.01 gave better perf than beta=0.1
The global batch size was tuned for best perf (it turns out smaller batch sizes tend to work better for QLoRA). tl;dr we use GBS=128 for SFT/DPO with full-training and GBS=64/32 for SFT/DPO with QLoRA
The lora_r and lora_alpha hparam were tuned for best perf - it turns out lora_r=lora_alpha=16 is good for DPO irrespective of what values were used for SFT
Reducing the number of DPO epochs to 1 gave similar perf as the original Zephyr model, while being more compute efficient.
Using AdamW and a cosine scheduler gave better perf in DPO

MT Bench Scores

There's some variability in MT-Bench, so treat these scores with a +/- 0.1 uncertainty:

Model	MT-Bench Score
alignment-handbook/zephyr-7b-sft-full	6.350
alignment-handbook/zephyr-7b-dpo-full	7.403
alignment-handbook/zephyr-7b-sft-qlora	6.484
alignment-handbook/zephyr-7b-dpo-qlora	7.544

Codebase changes

The formatting of dialogues for DPO has been extended to multi-turn contexts
Added helper function to enable intermediate checkpoint loading for failed runs
Add DPO loss_type to config
Fixed upload hanging with DeepSpeed when pushing checkpoint to Hub

TODO

Update ultrafeedback_binarized to reload the fixed ultrafeedback dataset and filter out TruthfulQA samples (PR https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized/discussions/3)
Run SFT and evaluate on MT Bench for full training and QLoRA
Run DPO and evaluate on MT Bench for full training and QLoRA
Update handbook model repos with new checkpoints and add note about change

Closes #87 #85 #68 #61 #45 #72 #44 #24 #59

lewtun · 2024-01-04T22:29:48Z

recipes/zephyr-7b-beta/dpo/config_full.yaml

@@ -4,34 +4,35 @@ model_name_or_path: alignment-handbook/zephyr-7b-sft-full
 # Data training arguments
 # For definitions, see: src/h4/training/config.py
 dataset_mixer:
-  HuggingFaceH4/ultrafeedback_binarized: 1.0
+  HuggingFaceH4/ultrafeedback_binarized_fixed: 1.0


Replace with the original source once we fix the dataset:

Suggested change

HuggingFaceH4/ultrafeedback_binarized_fixed: 1.0

HuggingFaceH4/ultrafeedback_binarized: 1.0

lewtun · 2024-01-04T22:30:00Z

recipes/zephyr-7b-beta/dpo/config_qlora.yaml


 # Data training arguments

 dataset_mixer:
-  HuggingFaceH4/ultrafeedback_binarized: 1.0
+  HuggingFaceH4/ultrafeedback_binarized_fixed: 1.0


Suggested change

HuggingFaceH4/ultrafeedback_binarized_fixed: 1.0

HuggingFaceH4/ultrafeedback_binarized: 1.0

nathan-az · 2024-01-05T05:12:26Z

recipes/zephyr-7b-beta/sft/config_qlora.yaml

-torch_dtype: auto
-use_flash_attention_2: true
+model_revision: main
+torch_dtype: float16


Is it correct that training was done with float16 for the qlora training but bfloat16 for full parameter training? (And any reason for this, if so?)

Yes that's correct - the main reason for these dtypes is that with 4-bit quantization, the other modules will be cast to float16 by default and I prefer to be explicit about this, while bfloat16 is needed for compatibility with FlashAttention2

Thanks @lewtun !

main reason for these dtypes is that with 4-bit quantization, the other modules will be cast to float16 by default

I assume you are referring to bnb_4bit_compute_dtype being set to bfloat16 in get_quantization_config. Is there merit in making this configurable?

Since the mistral 7b base is bfloat16 by default, would having a consistent type by also setting compute_dtype to bfloat16 (and torch_dtype to bfloat16) have any benefit?

I'm no expert here (on the memory representation or on how BNB/peft work) - my assumption is just that since they technically have different dynamic ranges there may be some benefit to remaining across training steps (the pretrained base, sft, and dpo) with the compute dtype.

I assume you are referring to bnb_4bit_compute_dtype being set to bfloat16 in get_quantization_config. Is there merit in making this configurable?

Ah yes, I'm referring to this line and I think it would be good to set this as whatever the torch_dtype is in model_args (with float16 the default)

Since the mistral 7b base is bfloat16 by default, would having a consistent type by also setting compute_dtype to bfloat16 (and torch_dtype to bfloat16) have any benefit?

We haven't tested the effect of bfloat16 vs float16 with QLoRA, so once this PR is merged I can run a few experiments to test :)

ChenDRAG · 2024-01-06T06:57:27Z

src/alignment/data.py

-            )
-            example["text_chosen"] = _strip_prefix(example["text_chosen"], assistant_prefix)
-            example["text_rejected"] = _strip_prefix(example["text_rejected"], assistant_prefix)
+            example["text_prompt"] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)


Hi, I noticed there are some inconsistencies here. _strip_prefix function and add_generation_prompt=True opinion are missing? Is this intended? and why is that, please?

Yes, this function was refactored to support multi-turn preference datasets and in the process I realised we could simplify the logic considerably by extracting the prompt & chosen / rejected responses directly from the list of dicts instead of formatting the string with the chat template

Do we have any example for using multiturn data?

edbeeching

Thanks @lewtun, no comments from my side. LGTM

lewtun added 17 commits January 4, 2024 04:27

Add files

8a337d7

Add checkpointing

8e414af

Add checkpointing to SFT

305081a

Add loss type

4f6470d

Fix setup|

c81c17c

Clean SFT

86e55d5

Add lora config

64785f8

Merge branch 'main' into zephyr-repro

8b76d83

Rename config

e704417

Remove max eval samples

ec20cd5

Add kwargs tp push to hub

41c9d19

Add DPO configs

bb80787

Fix dpo configs

89165ac

Extend chat template test to multi-turn

3ef9ce2

Add warmup

5ec3c65

Merge branch 'main' into zephyr-repro

5189e22

Refactor

7f78989

lewtun requested a review from edbeeching January 4, 2024 22:24

lewtun commented Jan 4, 2024

View reviewed changes

nathan-az reviewed Jan 5, 2024

View reviewed changes

ChenDRAG reviewed Jan 6, 2024

View reviewed changes

lewtun marked this pull request as ready for review January 8, 2024 06:53

lewtun changed the title ~~[WIP] Update Zephyr configs to account for UltraFeedback & TRL fixes~~ Update Zephyr configs to account for UltraFeedback & TRL fixes Jan 8, 2024

lewtun added 2 commits January 8, 2024 07:51

Fix LoRA -> QLoRA

4ae8e98

Merge branch 'main' into zephyr-repro

4ffa642

edbeeching approved these changes Jan 8, 2024

View reviewed changes

lewtun added 3 commits January 8, 2024 12:19

Fix configs

5fe5caf

Specify chat template

2f30bed

Add sample logging

323e857

lewtun added 4 commits January 9, 2024 02:35

Fix push to hub hanging

dae814a

Add reentrant

ad4cefe

Fix quality

962590b

Add transformer logging

23b2f8b

lewtun mentioned this pull request Jan 9, 2024

Get this error on run_sft.py when calling "trainer.push_to_hub": [Rank 0] Watchdog caught collective operation timeout #59

Open

lewtun added 3 commits January 9, 2024 21:37

Tweak grad acc

55f0449

Add null type

d684a67

Add doc

20b6173

lewtun merged commit f0ffa0d into main Jan 10, 2024
3 checks passed

lewtun deleted the zephyr-repro branch January 10, 2024 06:42

normster mentioned this pull request Jan 10, 2024

Is QLoRA better than finetuning? #98

Open

lewtun mentioned this pull request Feb 23, 2024

Unable to reproduce performance uclaml/SPIN#12

Open

yumeng5 mentioned this pull request May 29, 2024

Mismatch of results princeton-nlp/SimPO#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Zephyr configs to account for UltraFeedback & TRL fixes #88

Update Zephyr configs to account for UltraFeedback & TRL fixes #88

lewtun commented Jan 4, 2024 •

edited

Loading

lewtun Jan 4, 2024

lewtun Jan 4, 2024

nathan-az Jan 5, 2024

lewtun Jan 8, 2024

nathan-az Jan 9, 2024 •

edited

Loading

lewtun Jan 10, 2024

ChenDRAG Jan 6, 2024

lewtun Jan 8, 2024

hahuyhoang411 Jan 10, 2024

edbeeching left a comment

	HuggingFaceH4/ultrafeedback_binarized_fixed: 1.0
	HuggingFaceH4/ultrafeedback_binarized: 1.0

Update Zephyr configs to account for UltraFeedback & TRL fixes #88

Update Zephyr configs to account for UltraFeedback & TRL fixes #88

Conversation

lewtun commented Jan 4, 2024 • edited Loading

MT Bench Scores

Codebase changes

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nathan-az Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edbeeching left a comment

Choose a reason for hiding this comment

lewtun commented Jan 4, 2024 •

edited

Loading

nathan-az Jan 9, 2024 •

edited

Loading