Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ilab train - stopped but not seeing error and cause #327

Open
acsankar opened this issue Nov 10, 2024 · 0 comments
Open

ilab train - stopped but not seeing error and cause #327

acsankar opened this issue Nov 10, 2024 · 0 comments

Comments

@acsankar
Copy link

Getting the below issue when training. it's on M3 macbook pro 18 GB

(granite1) sankar@Sankars-MacBook-Pro node_datasets_2024-11-10T18_03_53 % ilab train --iters 50
INFO 2024-11-10 19:40:26,446 numexpr.utils:161: NumExpr defaulting to 11 threads.
INFO 2024-11-10 19:40:26,546 datasets:59: PyTorch version 2.4.1 available.
data arguments are:
{"data_path":"/Users/sankar/.local/share/instructlab/datasets/messages_merlinite-7b-lab-Q4_K_M_2024-11-10T18_03_53.jsonl","data_output_path":"/Users/sankar/.local/share/instructlab/internal","max_seq_len":1500,"model_path":"instructlab/granite-7b-lab","chat_tmpl_path":"/Users/sankar/Desktop/Code/granite/test1/granite1/lib/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py","num_cpu_procs":16}
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 2024-11-10 19:40:27,575 root:617: Special tokens: eos: [32000], pad: [32001], bos: [32005], system: [32004], user: [32002], assistant: [32003]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
tokenizing the dataset with instructlab/granite-7b-lab tokenizer...
ten largest length percentiles:
quantile 90th: 79.0
quantile 91th: 81.0
quantile 92th: 82.12
quantile 93th: 86.46000000000004
quantile 94th: 89.0
quantile 95th: 93.94999999999993
quantile 96th: 97.0
quantile 97th: 103.16999999999996
quantile 98th: 110.0
quantile 99th: 124.38999999999999
quantile 100th: 171.0

at 1500 max sequence length, the number of samples to be dropped is 0
(0.00% of total)
quantile 0th: 19.0
quantile 1th: 21.0
quantile 2th: 21.22
quantile 3th: 22.0
quantile 4th: 22.0
quantile 5th: 23.0
quantile 6th: 23.0
quantile 7th: 23.0
quantile 8th: 23.0
quantile 9th: 24.0
quantile 10th: 24.0
at 20 min sequence length, the number of samples to be dropped is 1
checking the validity of the samples...
INFO 2024-11-10 19:40:28,470 root:617: number of dropped samples: 1 -- out of 562
Categorizing training data type...
unmasking the appropriate message content...
Samples Previews...

Original Input: <|user|>
How many key features does InfyBILL have?
<|assistant|>
' 'InfyBILL has six key features, which are Client Management, Policy Management, Billing Operations, Payment Processing, Reports and Analytics, and Administration.'<|endoftext|>

Instruction ex sample 1:
' 'InfyBILL has six key features, which are Client Management, Policy Management, Billing Operations, Payment Processing, Reports and Analytics, and Administration.'<|endoftext|>
Original Input: <|user|>
What is the name of the database used for storing client-related tables in InfyBill?
<|assistant|>
'<|endoftext|>

Instruction ex sample 2:
'<|endoftext|>
Original Input: <|user|>
What is the command to navigate to the Payment Processing module in InfyBill?
<|assistant|>
'<|endoftext|>

Instruction ex sample 3:
'<|endoftext|>
Creating json from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 192.59ba/s]
Generating train split: 561 examples [00:00, 58493.16 examples/s]
Map (num_proc=8): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 561/561 [00:00<00:00, 5296.45 examples/s]
INFO 2024-11-10 19:40:32,029 instructlab.model.full_train:102: avg_sample_len: 49.08912655971479
effective_batch_size: 10
max_batch_len: 5000
packing_max_batch_len: 490
grad_accum: 1
num_batches: 103
avg_samples_per_batch: 5.446601941747573
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:34<00:00, 11.49s/it]
WARNING 2024-11-10 19:41:07,307 instructlab.model.full_train:133: There is a mismatch between bos token id of model (1) and tokenizer (32005). These tokens denote the start of a sequence of data. Fixing model bos token id to be same as tokenizer's bos token id.
WARNING 2024-11-10 19:41:07,307 instructlab.model.full_train:142: There is a mismatch between eos token id of model (2) and tokenizer (32000). These tokens denote the end of a sequence of data. Fixing model eos token id to be same as tokenizer's eos token id.
INFO 2024-11-10 19:41:07,309 instructlab.model.full_train:155: Total RAM: 18.00 GB
Epoch 0: 0%| | 0/103 [00:00<?, ?it/s]INFO 2024-11-10 19:43:02,459 instructlab.model.full_train:214:
Epoch: 0, Step: 1, Rank: 0, loss = 3.0850918292999268
**/Users/sankar/Desktop/Code/granite/test1/granite1/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(ctx.cpu_autocast_kwargs): # type: ignore[attr-defined]
zsh: killed ilab train --iters 50
(granite1) sankar@Sankars-MacBook-Pro node_datasets_2024-11-10T18_03_53 % /opt/homebrew/Cellar/[email protected]/3.11.10/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 21 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant