-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k2SSL: a Faster and Better Framework for Self-Supervised Speech Representation Learning #1500
Conversation
For LibriSpeech, we directly use the k-means labels from hubert_base_ls960.
Yes, you can replace the ConvFeatureExtractionModel with the Conv2dSubsampling. |
@yfyeung What are approximate increase in WER and training time and inference if this K2SSL is used with say Hubert base? |
checkpoint convert script
Guys, I just noticed this, it seems like a great contribution. |
Hi there @yfyeung , first of all thank you for creating this SSL recipe! I tried running your zipformer/ codes, but my model diverged at epoch 33 and pretraining ended with a Throughout pretraining before the divergence, I noticed my Did you face the same issues? EDIT: I was also wondering if you tried toggling the loss reduction to My commands. I adapted the batch size to my setup, maintaining the same # pretraining
python zipformer/pretrain.py \
--world-size 4 \
--use-fp16 1 \
--num-epochs 50 \
--manifest-dir data/raw \
--max-duration 350 \
--accum-grad 2 \
--exp-dir zipformer/exp2/pretrain As per your explanation, I used the same 500 k-means labels from |
Hi, hope this message finds you well. My training command is as follows: ./zipformer/pretrain.py \
--world-size 8 \
--num-epochs 291 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp_pretrain \
--full-libri 1 \
--max-duration 600 \
--accum-grad 1 \
--do-normalize 0 \
--mask-prob 0.8 \
--dropout-input 0.1 \
--dropout-features 0.1 \
--feature-grad-mult 0.1 \
--untie-final-proj 1 \
--num-encoder-layers 2,2,3,4,3,2 \
--feedforward-dim 512,768,1024,1536,1024,768 \
--encoder-dim 192,256,448,768,448,192 \
--encoder-unmasked-dim 192,192,256,256,256,192 \
--base-lr 0.045
Regarding your question about toggling the loss reduction to mean instead of sum to stabilize training: the mean reduction is typically used for multi-GPU simulations to ensure uniform scaling, while sum reduction is preferred for larger batch sizes as it helps stabilize the gradient estimate. It’s not a good way to optimize for both large batch sizes and multi-GPU setups simultaneously. Fine-tuning command is: ./zipformer/finetune.py \
--world-size 8 \
--num-epochs 222 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp_finetune \
--pretrained-dir zipformer/exp_pretrain/epoch-291.pt \
--full-libri 0 \
--max-duration 600 \
--accum-grad 1 \
--do-normalize 0 \
--mask-prob 0.65 \
--mask-channel-prob 0.5 \
--mask-channel-length 64 \
--feature-grad-mult 0.0 \
--num-encoder-layers 2,2,3,4,3,2 \
--feedforward-dim 512,768,1024,1536,1024,768 \
--encoder-dim 192,256,448,768,448,192 \
--encoder-unmasked-dim 192,192,256,256,256,192 \
--base-lr 0.002 Decoding uses greedy search to identify the top K candidates based on two key parameters: for ((epoch=100; epoch<=222; epoch+=1)); do
for ((avg=1; avg<=$epoch-1; avg+=1)); do
./zipformer/decode.py \
--epoch $epoch \
--avg $avg \
--exp-dir ./zipformer/exp_finetune \
--do-normalize 0 \
--max-duration 1000 \
--decoding-method greedy_search \
--num-encoder-layers 2,2,3,4,3,2 \
--feedforward-dim 512,768,1024,1536,1024,768 \
--encoder-dim 192,256,448,768,448,192 \
--encoder-unmasked-dim 192,192,256,256,256,192
done
done Then use modified beam search on these top K candidates: epoch=
avg=
./zipformer/decode.py \
--epoch $epoch \
--avg $avg \
--exp-dir ./zipformer/exp_finetune \
--do-normalize 0 \
--max-duration 1000 \
--decoding-method modified_beam_search \
--beam-size 8 \
--num-encoder-layers 2,2,3,4,3,2 \
--feedforward-dim 512,768,1024,1536,1024,768 \
--encoder-dim 192,256,448,768,448,192 \
--encoder-unmasked-dim 192,192,256,256,256,192 |
I see! Thanks for the explanation! Meanwhile, can you share your finetuning and decoding commands as well? |
Sure, I updated my comment. You can perform pruning in the process of searching the decoding space. |
@teowenshen is there any chance you can run with from your --start-epoch=33 with the --inf-check=True option, assuming pretrain.py supports these options like train.py; and show us the log? If the options are not there we should add them. I want to see where the inf grad is coming from, maybe we can fix it with more info. |
Also, @yfyeung we normally have a README.md and/or RESULTS.md that show typical sequences of training and testing commands, and associated results. Is there any chance of adding those? |
Yes, please find the logs for epoch 33 as attached. librispeech_SSL_zipformer_pretrain_ep33_infcheck.txt I couldn't run
|
for diaagnostics need to disable fp16 and halve batch size.
…On Friday, April 12, 2024, Teo Wen Shen ***@***.***> wrote:
I want to see where the inf grad is coming from, maybe we can fix it with
more info.
Yes, please find the logs for epoch 33 as attached.
librispeech_SSL_zipformer_pretrain_ep33_infcheck.txt
<https://github.com/k2-fsa/icefall/files/14959129/librispeech_SSL_zipformer_pretrain_ep33_infcheck.txt>
I couldn't run --print-diagnostics 1 due to this error:
Error getting eigenvalues, trying another method.
Error getting eigenvalues, trying another method.
Error getting eigenvalues, trying another method.
Error getting eigenvalues, trying another method.
/workspace/icefall/icefall/diagnostics.py:255: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/EmptyTensor.cpp:31.)
eigs, _ = torch.linalg.eig(stats)
/workspace/icefall/icefall/diagnostics.py:255: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/EmptyTensor.cpp:31.)
eigs, _ = torch.linalg.eig(stats)
/workspace/icefall/icefall/diagnostics.py:255: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/EmptyTensor.cpp:31.)
eigs, _ = torch.linalg.eig(stats)
/workspace/icefall/icefall/diagnostics.py:255: UserWarning: ComplexHalf support is experimental and many operators don't support it yet. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/EmptyTensor.cpp:31.)
eigs, _ = torch.linalg.eig(stats)
Traceback (most recent call last):
File "/mnt/host/icefall-k2ssl/egs/librispeech/SSL/zipformer/pretrain.py", line 1380, in <module>
main()
File "/mnt/host/icefall-k2ssl/egs/librispeech/SSL/zipformer/pretrain.py", line 1371, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/workspace/icefall/icefall/diagnostics.py", line 248, in print_diagnostics
eigs, _ = torch.linalg.eigh(stats)
RuntimeError: "linalg_eigh_cuda" not implemented for 'Half'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/mnt/host/icefall-k2ssl/egs/librispeech/SSL/zipformer/pretrain.py", line 1276, in run
diagnostic.print_diagnostics()
File "/workspace/icefall/icefall/diagnostics.py", line 517, in print_diagnostics
self.diagnostics[k].print_diagnostics()
File "/workspace/icefall/icefall/diagnostics.py", line 255, in print_diagnostics
eigs, _ = torch.linalg.eig(stats)
RuntimeError: torch.linalg.eig: input tensor should not contain infs or NaNs.
—
Reply to this email directly, view it on GitHub
<#1500 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3CZWT65GEBXVRV5KDY47MONAVCNFSM6AAAAABDN2HKACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRG4ZTINJWGU>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
The error was unusual, it was an infinity in the forward-pass. This is because you used the wav2vec2 frontend and it doesn't have any balancers or similar code to stop large values appearing. ScaledAdam can make large values appear faster than Adam would, although even with Adam they'll appear eventually unless steps are taken to stop it.
Anyway, this PR |
Sure, I will add those after the anonymity period ends, including the model checkpoint/tensorboard/pre-training logs/fine-tuning logs/decoding logs, and RESULTS.md. And if things go well, also a link to the paper. |
How did you prepare the input data i.e. the manifest dir for zipformer pretrain? |
I add the |
Hi, check out the code in dataset https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/SSL/hubert/dataset.py#L80-L85 for the specific format. It only has one more field compared with the wav CutSet. |
Hi @yfyeung
How did you obtain the kmeans? Is there any codebase available? |
Same way in fairseq, simple kmeans. |
In this PR, we decoupled HuBERT from fairseq, making it independent from the fairseq library while maintaining full equivalence with the original pre-training logic (model architecture, data normalization, masking strategy, loss computation...). We conducted comparisons on the outputs of some layers to ensure this equivalence. Additionally, we support the checkpoints from fairseq (hubert_base_ls960, hubert_large_ll60k, hubert_xtralarge_ll60k).
Then, we optimized the pre-train loss, significantly reducing peak memory usage and even slightly enhancing performance. Unfortunately, this improvement rendered the original HuBERT's half-precision unstable. We adopted ScaledAdam as the optimizer and Eden as the scheduler and replaced the Transformer encoder with the Zipformer encoder. This approach further reduced peak memory usage and enhanced performance, maintaining stability in half-precision.