Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help Wanted] Training from scratch on 1000 hours of Spanish does not work #565

Closed
4 tasks done
rlenain opened this issue Dec 2, 2024 · 8 comments
Closed
4 tasks done
Labels
help wanted Extra attention is needed

Comments

@rlenain
Copy link

rlenain commented Dec 2, 2024

Checks

  • This template is only for usage issues encountered.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I confirm that I am using English to submit this report in order to facilitate communication.

Environment Details

Linux, Python=3.10

Steps to Reproduce

I ran finetune_cli.py with --finetune False, so training from scratch, with 1000 hours of Spanish data and even after 500k steps, I am still not getting speech out. It sounds like the original speaker from the prompt sometimes, but the words being uttered are complete gibberish.

Any help on this?

✔️ Expected Behavior

I would like to get speech.

❌ Actual Behavior

Gibberish

@rlenain rlenain added the help wanted Extra attention is needed label Dec 2, 2024
@SWivid
Copy link
Owner

SWivid commented Dec 2, 2024

need more info, e.g. detailed configuration of training setup

@rlenain
Copy link
Author

rlenain commented Dec 2, 2024

The command I run (I have made a few changes to the repo around how to pass experiment names, etc, nothing to do with the actual training) is:

CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch --main_process_port 29501 finetune-cli.py \ 
--model_name F5TTS_Base  --exp_name F5TTS_Base-FromScratch_1khrs_esLA --learning_rate 1e-05 \
 --batch_size_per_gpu 20000 --batch_size_type frame --max_samples 64 --grad_accumulation_steps 1 \
--max_grad_norm 1 --epochs 500 --num_warmup_updates 10000 --save_per_updates 100000 \
 --last_per_steps 5000 --dataset_name 1000hours_esLA_fromL --finetune False --tokenizer char

I run on 8*A100 GPUs

Audio sounds like this after 500k steps https://whyp.it/tracks/231435/gibberish?token=TQ1fK.

When I run the exact same setup but with --finetune True, it works fairly well and I get good Spanish speech out.

@rlenain rlenain changed the title [Help Wanted] [Help Wanted] Training from scratch on 1000 hours of Spanish does not work Dec 2, 2024
@SWivid
Copy link
Owner

SWivid commented Dec 2, 2024

so actually 4*a100 CUDA_VISIBLE_DEVICES=4,5,6,7
thought the learning rate is too small for training from scratch
would recommend the same setting as in our paper or train.py (if you have pulled the latest repo, the config of base model is under config/ directory)

also you may refer to #548 , as you are using a 1000hours dataset

@rlenain
Copy link
Author

rlenain commented Dec 3, 2024

Thank you, changing the learning rate and increasing the number of warmup updates has helped

@tuanh123789
Copy link

hi @rlenain can you share the dataset

@rlenain
Copy link
Author

rlenain commented Dec 4, 2024

Unfortunately I cannot

@Federico1666
Copy link

Unfortunately I cannot

puedes compartir el modelo final? estoy necesitando un buen modelo en español y el que hay disponible solo tiene 250 horas de entrenamiento

@SWivid SWivid closed this as completed Jan 5, 2025
@ukemamaster
Copy link

@rlenain Is your data in LJSpeech style? Which recipe exactly did you use for your training? Can you share you training configuration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants