ilab train - stopped but not seeing error and cause #327

acsankar · 2024-11-10T14:18:51Z

Getting the below issue when training. it's on M3 macbook pro 18 GB

(granite1) sankar@Sankars-MacBook-Pro node_datasets_2024-11-10T18_03_53 % ilab train --iters 50
INFO 2024-11-10 19:40:26,446 numexpr.utils:161: NumExpr defaulting to 11 threads.
INFO 2024-11-10 19:40:26,546 datasets:59: PyTorch version 2.4.1 available.
data arguments are:
{"data_path":"/Users/sankar/.local/share/instructlab/datasets/messages_merlinite-7b-lab-Q4_K_M_2024-11-10T18_03_53.jsonl","data_output_path":"/Users/sankar/.local/share/instructlab/internal","max_seq_len":1500,"model_path":"instructlab/granite-7b-lab","chat_tmpl_path":"/Users/sankar/Desktop/Code/granite/test1/granite1/lib/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py","num_cpu_procs":16}
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 2024-11-10 19:40:27,575 root:617: Special tokens: eos: [32000], pad: [32001], bos: [32005], system: [32004], user: [32002], assistant: [32003]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
tokenizing the dataset with instructlab/granite-7b-lab tokenizer...
ten largest length percentiles:
quantile 90th: 79.0
quantile 91th: 81.0
quantile 92th: 82.12
quantile 93th: 86.46000000000004
quantile 94th: 89.0
quantile 95th: 93.94999999999993
quantile 96th: 97.0
quantile 97th: 103.16999999999996
quantile 98th: 110.0
quantile 99th: 124.38999999999999
quantile 100th: 171.0

at 1500 max sequence length, the number of samples to be dropped is 0
(0.00% of total)
quantile 0th: 19.0
quantile 1th: 21.0
quantile 2th: 21.22
quantile 3th: 22.0
quantile 4th: 22.0
quantile 5th: 23.0
quantile 6th: 23.0
quantile 7th: 23.0
quantile 8th: 23.0
quantile 9th: 24.0
quantile 10th: 24.0
at 20 min sequence length, the number of samples to be dropped is 1
checking the validity of the samples...
INFO 2024-11-10 19:40:28,470 root:617: number of dropped samples: 1 -- out of 562
Categorizing training data type...
unmasking the appropriate message content...
Samples Previews...

Instruction ex sample 3:
'<|endoftext|>
Creating json from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 192.59ba/s]
Generating train split: 561 examples [00:00, 58493.16 examples/s]
Map (num_proc=8): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 561/561 [00:00<00:00, 5296.45 examples/s]
INFO 2024-11-10 19:40:32,029 instructlab.model.full_train:102: avg_sample_len: 49.08912655971479
effective_batch_size: 10
max_batch_len: 5000
packing_max_batch_len: 490
grad_accum: 1
num_batches: 103
avg_samples_per_batch: 5.446601941747573
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:34<00:00, 11.49s/it]
WARNING 2024-11-10 19:41:07,307 instructlab.model.full_train:133: There is a mismatch between bos token id of model (1) and tokenizer (32005). These tokens denote the start of a sequence of data. Fixing model bos token id to be same as tokenizer's bos token id.
WARNING 2024-11-10 19:41:07,307 instructlab.model.full_train:142: There is a mismatch between eos token id of model (2) and tokenizer (32000). These tokens denote the end of a sequence of data. Fixing model eos token id to be same as tokenizer's eos token id.
INFO 2024-11-10 19:41:07,309 instructlab.model.full_train:155: Total RAM: 18.00 GB
Epoch 0: 0%| | 0/103 [00:00<?, ?it/s]INFO 2024-11-10 19:43:02,459 instructlab.model.full_train:214:
Epoch: 0, Step: 1, Rank: 0, loss = 3.0850918292999268
**/Users/sankar/Desktop/Code/granite/test1/granite1/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
zsh: killed ilab train --iters 50
(granite1) sankar@Sankars-MacBook-Pro node_datasets_2024-11-10T18_03_53 % /opt/homebrew/Cellar/[email protected]/3.11.10/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ilab train - stopped but not seeing error and cause #327

ilab train - stopped but not seeing error and cause #327

acsankar commented Nov 10, 2024

ilab train - stopped but not seeing error and cause #327

ilab train - stopped but not seeing error and cause #327

Comments

acsankar commented Nov 10, 2024