Getting huge number training steps #37

008karan · 2020-03-05T16:02:40Z

I have generated pretraining data using the given steps in this repo.
I am doing this for the Hindi language with 22gb of data. Generating pretraining data itself took 1 month!
So I have meta_data file associated with each tf.record file. I have added all the train_data_size values from all the meta_data files to make one meta_data file because in run_pretraining.py requires it. So my final meta_data file which looks something like this:

{
    "task_type": "albert_pretraining",
    "train_data_size": 596972848,
    "max_seq_length": 512,
    "max_predictions_per_seq": 20
}

Here number of training steps are calculated as below:

num_train_steps = int(total_train_examples / train_batch_size) * num_train_epochs

So total_train_examples is 596972848 hence I am getting num_train_steps to be 9327700 with batch size of 64 and with 1 epoch only. I saw that in readme here num_train_steps=125000. I am not getting whats went wrong here.

With such huge train steps, it will take forever to train Albert. Even if I make batch size to 512 with 1 epoch only the training step will be 1165962 which is still huge!
As Albert was trained on very huge data why there are only 125000 steps only?
Want to know-how many epochs are there in Albert training for English?

Can anyone suggest what went wrong and what should I do now?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting huge number training steps #37

Getting huge number training steps #37

008karan commented Mar 5, 2020

Getting huge number training steps #37

Getting huge number training steps #37

Comments

008karan commented Mar 5, 2020