Training Interruptions and Epoch Skipping with 6 Billion Parameter Model on 8 A100 GPUs #24

apt-team-018 · 2023-11-14T06:20:00Z

I attempted to fine-tune a 6 billion parameter model using 8 A100 GPUs, but the training process encountered interruptions. On the first attempt, it stopped at 0.15 epochs, and on the second attempt, where I started from 2 epochs, it oddly skipped some epochs, jumping from 0.15 directly to 1, and then stopped at 2.25. For more detailed information, you can check this WandB link - https://wandb.ai/neural-network-018/huggingface/runs/8xmy6gtd/

Configs -

Model arguments

model_name_or_path: 01-ai/Yi-6B
model_revision: main
torch_dtype: bfloat16
use_flash_attention_2: false
trust_remote_code: true

Data training arguments

dataset_mixer:
communityai/apt-chat-micro-dataset-llm-v2-714k: 0.4
dataset_splits:

train
test
preprocessing_num_workers: 12

SFT trainer config

bf16: true
do_eval: true
evaluation_strategy: epoch
gradient_accumulation_steps: 4
gradient_checkpointing: false
hub_model_id: apt-chat-yi-6B-sft-full
hub_strategy: every_save
learning_rate: 0.00002
log_level: info
logging_steps: 50
logging_strategy: steps
lr_scheduler_type: cosine
max_seq_length: 4096
max_steps: -1
num_train_epochs: 2
output_dir: data/apt-chat-yi-6B-sft-full
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
push_to_hub: true
remove_unused_columns: true
report_to:

wandb
save_strategy: "no"
save_total_limit: null
seed: 42
tf32: true

LOGS -

INFO:root:Using nproc_per_node=8.
[2023-11-14 02:09:37,658] torch.distributed.run: [WARNING]
[2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] *****************************************
[2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] *****************************************
[2023-11-14 02:09:45,328] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,584] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead.
warnings.warn(
[2023-11-14 02:09:45,607] [INFO] [comm.py:637:init_distributed] cdb=None
2023-11-14 02:09:45 - WARNING - main - Process rank: 7, device: cuda:7, n_gpu: 1 distributed training: True, 16-bits training: False
[2023-11-14 02:09:45,646] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,793] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,832] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,834] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-14 02:09:45,835] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead.
warnings.warn(
[2023-11-14 02:09:45,864] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-11-14 02:09:45,908] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead.
warnings.warn(
2023-11-14 02:09:45 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1 distributed training: True, 16-bits training: False
[2023-11-14 02:09:45,939] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-11-14 02:09:45,939] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
2023-11-14 02:09:45 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
2023-11-14 02:09:45 - INFO - main - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='01-ai/Yi-6B', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', trust_remote_code=True, use_flash_attention_2=False, use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False)
2023-11-14 02:09:45 - INFO - main - Data parameters DataArguments(chat_template=None, dataset_mixer={'communityai/apt-chat-micro-dataset-llm-v2-714k': 0.4}, dataset_splits=['train', 'test'], max_train_samples=None, max_eval_samples=None, preprocessing_num_workers=12, truncation_side=None)
2023-11-14 02:09:45 - INFO - main - Training/evaluation parameters SFTConfig(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=apt-chat-yi-6B-sft-full,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=data/apt-chat-yi-6B-sft-full/runs/Nov14_02-09-45_6191edb408fa,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=50,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_seq_length=4096,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=2,
optim=adamw_torch,
optim_args=None,
output_dir=data/apt-chat-yi-6B-sft-full,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=True,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=data/apt-chat-yi-6B-sft-full,
save_on_each_node=False,
save_safetensors=True,
save_steps=500,
save_strategy=no,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=True,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
2023-11-14 02:09:48 - INFO - main - Training on the following datasets and their proportions: ['train : 285436', 'test : 500']
++++++++++++++++++++++++++++++++++++++
YiTokenizer(name_or_path='01-ai/Yi-6B', vocab_size=64000, model_max_length=4096, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '', 'pad_token': ''}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
1: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
2: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
2023-11-14 02:09:53 - INFO - main - *** Load pretrained model ***
neftune_noise_alpha - 5.0
training_args - 2023-11-14 02:09:53 - INFO - main - *** Model loaded! ***
neftune_noise_alpha - 5.0
training_args - SFTConfig(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=apt-chat-yi-6B-sft-full,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=5,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=data/apt-chat-yi-6B-sft-full/runs/Nov14_02-09-45_6191edb408fa,
logging_first_step=True,
logging_nan_inf_filter=True,
logging_steps=50,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=1.0,
max_seq_length=4096,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=2,
optim=adamw_torch,
optim_args=None,
output_dir=data/apt-chat-yi-6B-sft-full,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=True,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=data/apt-chat-yi-6B-sft-full,
save_on_each_node=False,
save_safetensors=True,
save_steps=500,
save_strategy=no,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=False,
tf32=True,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:145: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an AutoModelForCausalLM or a PeftModel (if you passed a peft_config) for you.
warnings.warn(
[INFO|configuration_utils.py:717] 2023-11-14 02:09:53,295 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/5978aa81cd0fb25852004e7a86c71435b3f8de31/config.json
[INFO|configuration_utils.py:717] 2023-11-14 02:09:53,384 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/5978aa81cd0fb25852004e7a86c71435b3f8de31/config.json
[INFO|configuration_utils.py:777] 2023-11-14 02:09:53,386 >> Model config YiConfig {
"_name_or_path": "01-ai/Yi-6B",
"architectures": [
"YiForCausalLM"
],
"auto_map": {
"AutoConfig": "01-ai/Yi-6B--configuration_yi.YiConfig",
"AutoModel": "01-ai/Yi-6B--modeling_yi.YiModel",
"AutoModelForCausalLM": "01-ai/Yi-6B--modeling_yi.YiForCausalLM"
},
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"model_type": "Yi",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 4,
"pad_token_id": 0,
"rms_norm_eps": 1e-05,
"rope_theta": 5000000.0,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.35.0",
"use_cache": true,
"vocab_size": 64000
}

[INFO|modeling_utils.py:3121] 2023-11-14 02:09:53,499 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/5978aa81cd0fb25852004e7a86c71435b3f8de31/model.safetensors.index.json
[INFO|modeling_utils.py:1222] 2023-11-14 02:09:53,501 >> Instantiating YiForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:791] 2023-11-14 02:09:53,503 >> Generate config GenerationConfig {
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 0
}

[2023-11-14 02:10:02,797] [INFO] [config.py:972:print] DeepSpeedEngine configuration:
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] amp_enabled .................. False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] amp_params ................... False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] bfloat16_enabled ............. True
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] checkpoint_parallel_write_pipeline False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] checkpoint_tag_validation_enabled True
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] checkpoint_tag_validation_fail False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f3e8d053e50>
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] communication_data_type ...... None
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] curriculum_enabled_legacy .... False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] curriculum_params_legacy ..... False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] data_efficiency_enabled ...... False
[2023-11-14 02:10:02,797] [INFO] [config.py:976:print] dataloader_drop_last ......... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] disable_allgather ............ False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] dump_state ................... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] dynamic_loss_scale_args ...... None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_enabled ........... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_gas_boundary_resolution 1
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_layer_num ......... 0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_max_iter .......... 100
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_stability ......... 1e-06
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_tol ............... 0.01
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_verbose ........... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] elasticity_enabled ........... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] fp16_auto_cast ............... None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] fp16_enabled ................. False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] fp16_master_weights_and_gradients False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] global_rank .................. 0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] grad_accum_dtype ............. None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] gradient_accumulation_steps .. 4
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] gradient_clipping ............ 0.0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] gradient_predivide_factor .... 1.0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] initial_dynamic_scale ........ 1
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] load_universal_checkpoint .... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] loss_scale ................... 1.0
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] memory_breakdown ............. False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] mics_hierarchial_params_gather False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] mics_shard_size .............. -1
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] optimizer_legacy_fusion ...... False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] optimizer_name ............... None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] optimizer_params ............. None
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] pld_enabled .................. False
[2023-11-14 02:10:02,798] [INFO] [config.py:976:print] pld_params ................... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] prescale_gradients ........... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] scheduler_name ............... None
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] scheduler_params ............. None
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] seq_parallel_communication_data_type torch.float32
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] sparse_attention ............. None
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] sparse_gradients_enabled ..... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] steps_per_print .............. inf
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] train_batch_size ............. 32
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] train_micro_batch_size_per_gpu 1
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] use_node_local_storage ....... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] wall_clock_breakdown ......... False
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] weight_quantization_config ... None
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] world_size ................... 8
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_allow_untested_optimizer True
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_enabled ................. True
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_force_ds_cpu_optimizer .. True
[2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_optimization_stage ...... 3
[2023-11-14 02:10:02,799] [INFO] [config.py:962:print_user_config] json = {
"train_batch_size": 32,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 4,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"nvme_path": null
},
"offload_param": {
"device": "none",
"nvme_path": null
},
"stage3_gather_16bit_weights_on_model_save": true
},
"steps_per_print": inf,
"bf16": {
"enabled": true
},
"fp16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
[INFO|trainer.py:1723] 2023-11-14 02:10:02,799 >> ***** Running training *****
[INFO|trainer.py:1724] 2023-11-14 02:10:02,799 >> Num examples = 285,436
[INFO|trainer.py:1725] 2023-11-14 02:10:02,799 >> Num Epochs = 2
[INFO|trainer.py:1726] 2023-11-14 02:10:02,799 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1729] 2023-11-14 02:10:02,799 >> Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1730] 2023-11-14 02:10:02,799 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1731] 2023-11-14 02:10:02,799 >> Total optimization steps = 17,840
[INFO|trainer.py:1732] 2023-11-14 02:10:02,801 >> Number of trainable parameters = 6,061,035,520
[INFO|integration_utils.py:718] 2023-11-14 02:10:02,802 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Currently logged in as: developer-team018 (neural-network-018). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.16.0
wandb: Run data is saved locally in /workspace/alignment-handbook/wandb/run-20231114_021003-8xmy6gtd
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run robust-plasma-26
wandb: ⭐️ View project at https://wandb.ai/neural-network-018/huggingface
wandb: 🚀 View run at https://wandb.ai/neural-network-018/huggingface/runs/8xmy6gtd
0%| | 0/17840 [00:00<?, ?it/s][WARNING|tokenization_utils_base.py:3831] 2023-11-14 02:10:23,512 >> Token indices sequence length is longer than the specified maximum sequence length for this model (6114 > 4096). Running this sequence through the model will result in indexing errors
{'loss': 1.7024, 'learning_rate': 1.9999999844947046e-05, 'epoch': 0.0}
{'loss': 1.1507, 'learning_rate': 1.999961237011484e-05, 'epoch': 0.01}
{'loss': 1.0928, 'learning_rate': 1.9998449510510744e-05, 'epoch': 0.01}
{'loss': 1.0793, 'learning_rate': 1.999651151133954e-05, 'epoch': 0.02}
{'loss': 1.0867, 'learning_rate': 1.999379852284651e-05, 'epoch': 0.02}
{'loss': 1.0857, 'learning_rate': 1.999031075535873e-05, 'epoch': 0.03}
{'loss': 1.0721, 'learning_rate': 1.9986048479268788e-05, 'epoch': 0.03}
{'loss': 1.0923, 'learning_rate': 1.99810120250138e-05, 'epoch': 0.04}
{'loss': 1.0836, 'learning_rate': 1.9975201783049804e-05, 'epoch': 0.04}
{'loss': 1.0769, 'learning_rate': 1.9968618203821487e-05, 'epoch': 0.05}
{'loss': 1.0574, 'learning_rate': 1.9961261797727256e-05, 'epoch': 0.06}
{'loss': 1.042, 'learning_rate': 1.9953133135079686e-05, 'epoch': 0.06}
{'loss': 1.0554, 'learning_rate': 1.9944232846061284e-05, 'epoch': 0.07}
{'loss': 1.0735, 'learning_rate': 1.993456162067566e-05, 'epoch': 0.07}
{'loss': 1.0785, 'learning_rate': 1.992412020869401e-05, 'epoch': 0.08}
{'loss': 1.0654, 'learning_rate': 1.9912909419596993e-05, 'epoch': 0.08}
{'loss': 1.0606, 'learning_rate': 1.9900930122511993e-05, 'epoch': 0.09}
{'loss': 1.0664, 'learning_rate': 1.988818324614572e-05, 'epoch': 0.1}
{'loss': 1.0604, 'learning_rate': 1.9874669778712215e-05, 'epoch': 0.1}
{'loss': 1.0674, 'learning_rate': 1.9860390767856244e-05, 'epoch': 0.11}
{'loss': 1.042, 'learning_rate': 1.984534732057208e-05, 'epoch': 0.11}
{'loss': 1.0452, 'learning_rate': 1.9829540603117667e-05, 'epoch': 0.12}
{'loss': 1.0577, 'learning_rate': 1.9812971840924222e-05, 'epoch': 0.12}
{'loss': 1.0471, 'learning_rate': 1.979564231850122e-05, 'epoch': 0.13}
{'loss': 1.0704, 'learning_rate': 1.977755337933682e-05, 'epoch': 0.13}
{'loss': 1.0282, 'learning_rate': 1.9758706425793702e-05, 'epoch': 0.14}
{'loss': 1.0515, 'learning_rate': 1.973910291900036e-05, 'epoch': 0.15}
{'loss': 1.0548, 'learning_rate': 1.97187443787378e-05, 'epoch': 0.15}
8%|██▌ | 1368/17840 [1:50:57<19:16:41, 4.21s/it][INFO|trainer.py:3158] 2023-11-14 04:01:02,181 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-14 04:01:02,182 >> Num examples = 500
[INFO|trainer.py:3163] 2023-11-14 04:01:02,182 >> Batch size = 1

0%| | 0/63 [00:00<?, ?it/s]
3%|█▍ | 2/63 [00:00<00:12, 5.02it/s]
5%|██ | 3/63 [00:00<00:12, 4.76it/s]
6%|██▊ | 4/63 [00:00<00:15, 3.89it/s]
8%|███▍ | 5/63 [00:01<00:16, 3.50it/s]
10%|████▏ | 6/63 [00:01<00:17, 3.30it/s]
11%|████▉ | 7/63 [00:01<00:17, 3.20it/s]
13%|█████▌ | 8/63 [00:02<00:17, 3.12it/s]

                                                                         {'eval_loss': 1.0247304439544678, 'eval_runtime': 4.5889, 'eval_samples_per_second': 108.959, 'eval_steps_per_second': 13.729, 'epoch': 0.15}

8%|██▌ | 1368/17840 [1:51:02<19:16:41, 4.21s/it]
14%|██████▎ | 9/63 [00:02<00:17, 3.14it/s]
{'loss': 0.9636, 'learning_rate': 1.9697632383321755e-05, 'epoch': 1.0}
{'loss': 0.9026, 'learning_rate': 1.96757685694803e-05, 'epoch': 1.01}
{'loss': 0.8808, 'learning_rate': 1.965315463222695e-05, 'epoch': 1.01}
{'loss': 0.8712, 'learning_rate': 1.9629792324729302e-05, 'epoch': 1.02}
{'loss': 0.8967, 'learning_rate': 1.960568345817306e-05, 'epoch': 1.03}
{'loss': 0.8676, 'learning_rate': 1.9580829901621666e-05, 'epoch': 1.03}
{'loss': 0.8723, 'learning_rate': 1.9555233581871366e-05, 'epoch': 1.04}
{'loss': 0.9122, 'learning_rate': 1.9528896483301866e-05, 'epoch': 1.04}
{'loss': 0.8687, 'learning_rate': 1.9501820647722458e-05, 'epoch': 1.05}
{'loss': 0.8726, 'learning_rate': 1.947400817421375e-05, 'epoch': 1.05}
{'loss': 0.8505, 'learning_rate': 1.944546121896493e-05, 'epoch': 1.06}
{'loss': 0.8458, 'learning_rate': 1.9416181995106585e-05, 'epoch': 1.07}
{'loss': 0.8721, 'learning_rate': 1.9386172772539162e-05, 'epoch': 1.07}
{'loss': 0.8676, 'learning_rate': 1.9355435877756957e-05, 'epoch': 1.08}
{'loss': 0.8826, 'learning_rate': 1.9323973693667762e-05, 'epoch': 1.08}
{'loss': 0.8607, 'learning_rate': 1.929178865940815e-05, 'epoch': 1.09}
{'loss': 0.8561, 'learning_rate': 1.925888327015434e-05, 'epoch': 1.09}
{'loss': 0.8687, 'learning_rate': 1.9225260076928783e-05, 'epoch': 1.1}
{'loss': 0.874, 'learning_rate': 1.919092168640239e-05, 'epoch': 1.1}
{'loss': 0.8563, 'learning_rate': 1.915587076069243e-05, 'epoch': 1.11}
{'loss': 0.8445, 'learning_rate': 1.9120110017156172e-05, 'epoch': 1.12}
{'loss': 0.8646, 'learning_rate': 1.908364222818019e-05, 'epoch': 1.12}
{'loss': 0.8479, 'learning_rate': 1.9046470220965457e-05, 'epoch': 1.13}
{'loss': 0.8788, 'learning_rate': 1.9008596877308157e-05, 'epoch': 1.13}
{'loss': 0.9, 'learning_rate': 1.8970025133376252e-05, 'epoch': 1.14}
{'loss': 0.8791, 'learning_rate': 1.893075797948188e-05, 'epoch': 1.14}
{'loss': 0.9254, 'learning_rate': 1.889079845984951e-05, 'epoch': 1.15}
15%|█████ | 2736/17840 [3:42:25<17:42:31, 4.22s/it][INFO|trainer.py:3158] 2023-11-14 05:52:30,316 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-14 05:52:30,317 >> Num examples = 500
[INFO|trainer.py:3163] 2023-11-14 05:52:30,317 >> Batch size = 1

0%| | 0/63 [00:00<?, ?it/s]
3%|█▍ | 2/63 [00:00<00:10, 6.07it/s]
5%|██ | 3/63 [00:00<00:14, 4.20it/s]
6%|██▊ | 4/63 [00:01<00:16, 3.63it/s]
8%|███▍ | 5/63 [00:01<00:17, 3.37it/s]
10%|████▏ | 6/63 [00:01<00:17, 3.23it/s]
11%|████▉ | 7/63 [00:02<00:17, 3.16it/s]
13%|█████▌ | 8/63 [00:02<00:17, 3.06it/s]

{'eval_loss': 1.0676991939544678, 'eval_runtime': 4.5191, 'eval_samples_per_second': 110.641, 'eval_steps_per_second': 13.941, 'epoch': 1.15}
15%|█████ | 2736/17840 [3:42:30<17:42:31, 4.22s/it]
14%|██████▎ | 9/63 [00:02<00:17, 3.09it/s]
[INFO|trainer.py:1955] 2023-11-14 05:52:34,837 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 13352.0365, 'train_samples_per_second': 42.755, 'train_steps_per_second': 1.336, 'train_loss': 0.9719247023264567, 'epoch': 1.15}
15%|█████ | 2736/17840 [3:42:30<20:28:20, 4.88s/it]
***** train metrics *****
epoch = 1.15
train_loss = 0.9719
train_runtime = 3:42:32.03
train_samples = 285436
train_samples_per_second = 42.755
train_steps_per_second = 1.336
2023-11-14 05:52:34 - INFO - main - *** Evaluate ***
[INFO|trainer.py:3158] 2023-11-14 05:52:34,843 >> ***** Running Evaluation *****
[INFO|trainer.py:3160] 2023-11-14 05:52:34,843 >> Num examples = 500
[INFO|trainer.py:3163] 2023-11-14 05:52:34,844 >> Batch size = 1
14%|██████▎ | 9/63 [00:02<00:16, 3.23it/s]
***** eval metrics *****
epoch = 1.15
eval_loss = 1.0677
eval_runtime = 0:00:04.48
eval_samples = 500
eval_samples_per_second = 111.451
eval_steps_per_second = 14.043
2023-11-14 05:52:39 - INFO - main - *** Save model ***
[INFO|trainer.py:2881] 2023-11-14 05:52:43,590 >> Saving model checkpoint to data/apt-chat-yi-6B-sft-full
[INFO|configuration_utils.py:461] 2023-11-14 05:52:43,592 >> Configuration saved in data/apt-chat-yi-6B-sft-full/config.json
[INFO|configuration_utils.py:564] 2023-11-14 05:52:43,592 >> Configuration saved in data/apt-chat-yi-6B-sft-full/generation_config.json
[INFO|modeling_utils.py:2201] 2023-11-14 05:52:51,334 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 3 checkpoint shards. You can find where each parameters has been saved in the index located at data/apt-chat-yi-6B-sft-full/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2428] 2023-11-14 05:52:51,336 >> tokenizer config file saved in data/apt-chat-yi-6B-sft-full/tokenizer_config.json
[INFO|tokenization_utils_base.py:2437] 2023-11-14 05:52:51,337 >> Special tokens file saved in data/apt-chat-yi-6B-sft-full/special_tokens_map.json
[INFO|trainer.py:2881] 2023-11-14 05:52:55,599 >> Saving model checkpoint to data/apt-chat-yi-6B-sft-full
[INFO|configuration_utils.py:461] 2023-11-14 05:52:55,601 >> Configuration saved in data/apt-chat-yi-6B-sft-full/config.json
[INFO|configuration_utils.py:564] 2023-11-14 05:52:55,601 >> Configuration saved in data/apt-chat-yi-6B-sft-full/generation_config.json
[INFO|modeling_utils.py:2201] 2023-11-14 05:53:06,302 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 3 checkpoint shards. You can find where each parameters has been saved in the index located at data/apt-chat-yi-6B-sft-full/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2428] 2023-11-14 05:53:06,303 >> tokenizer config file saved in data/apt-chat-yi-6B-sft-full/tokenizer_config.json
[INFO|tokenization_utils_base.py:2437] 2023-11-14 05:53:06,304 >> Special tokens file saved in data/apt-chat-yi-6B-sft-full/special_tokens_map.json
2023-11-14 05:55:20 - INFO - main - Model saved to data/apt-chat-yi-6B-sft-full
[INFO|modelcard.py:452] 2023-11-14 05:55:21,054 >> Dropping the following result as it does not have all the necessary fields:
{'dataset': {'name': 'communityai/apt-chat-micro-dataset-llm-v2-714k', 'type': 'communityai/apt-chat-micro-dataset-llm-v2-714k'}}
[INFO|configuration_utils.py:461] 2023-11-14 05:55:21,057 >> Configuration saved in data/apt-chat-yi-6B-sft-full/config.json
2023-11-14 05:55:21 - INFO - main - Pushing to hub...

The text was updated successfully, but these errors were encountered:

edbeeching · 2023-11-14T08:13:44Z

I think the epoch skipping is related to this issue in trl: huggingface/trl#943
I will aim to fix this next week. cc @lewtun

lewtun mentioned this issue Jan 8, 2024

Update Zephyr configs to account for UltraFeedback & TRL fixes #88

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Interruptions and Epoch Skipping with 6 Billion Parameter Model on 8 A100 GPUs #24

Training Interruptions and Epoch Skipping with 6 Billion Parameter Model on 8 A100 GPUs #24

apt-team-018 commented Nov 14, 2023

edbeeching commented Nov 14, 2023

Training Interruptions and Epoch Skipping with 6 Billion Parameter Model on 8 A100 GPUs #24

Training Interruptions and Epoch Skipping with 6 Billion Parameter Model on 8 A100 GPUs #24

Comments

apt-team-018 commented Nov 14, 2023

Model arguments

Data training arguments

SFT trainer config

edbeeching commented Nov 14, 2023