The train loss cannot convergence #7

fengshikun · 2024-07-16T02:18:50Z

Hello, I've been attempting to train the score model using the command from the README file. However, I've noticed that the loss doesn't seem to converge. Could you please help me investigate which part might be going wrong?

plainerman · 2024-07-16T07:00:33Z

We experienced similar behavior when training the model. This is why we optimized rmsd_lt2. You should see this increase. Do you?

fengshikun · 2024-07-16T07:17:45Z

Sorry, I just checked the logs and noticed that Val inference rmsds_lt2 consistently remains zero. Additionally, the validation loss shows fluctuating and abnormal values such as:

Epoch 19: Validation loss 42793415.1111  tr 171172360.1270   rot 1286.8611   tor 0.9821   sc_tor 0.9899
Epoch 20: Validation loss 330176498.0680  tr 1320697075.8095   rot 8896.0845   tor 0.9705   sc_tor 0.9875

I followed the command exactly as specified in the README file, so I suspect there might be a configuration issue or perhaps a bug in the code.

plainerman · 2024-07-16T07:27:38Z

It was common for us to have epochs with outliers and very big losses. But values should not be consistently this large.

In our run valinf_rmsds_lt2 only started to be promising after ~50 epochs. How long did you train for?

fengshikun · 2024-07-16T07:34:39Z

I have trained for approximately 100 epochs, and the latest results for Val inference rmsds_lt2 are consistently zero, as shown below:

Epoch 89: Val inference rmsds_lt2 0.000 rmsds_lt5 0.000 sc_rmsds_lt2 3.000 sc_rmsds_lt1 0.000, sc_rmsds_lt0.5 0.000 avg_improve 16.225 avg_worse 17.347  sc_rmsds_lt2_from_holo 3.000 sc_rmsds_lt1_from_holo 0.000, sc_rmsds_lt05_from_holo.5 0.000 sc_rmsds_avg_improvement_from_holo 15.128 sc_rmsds_avg_worsening_from_holo 19.187  
Storing best sc_rmsds_lt05_from_holo model
Run name:  big_score_model

Epoch 94: Val inference rmsds_lt2 0.000 rmsds_lt5 0.000 sc_rmsds_lt2 3.000 sc_rmsds_lt1 0.000, sc_rmsds_lt0.5 0.000 avg_improve 16.219 avg_worse 24.843  sc_rmsds_lt2_from_holo 2.000 sc_rmsds_lt1_from_holo 0.000, sc_rmsds_lt05_from_holo.5 0.000 sc_rmsds_avg_improvement_from_holo 16.306 sc_rmsds_avg_worsening_from_holo 18.885  
Storing best sc_rmsds_lt05_from_holo model
Run name:  big_score_model

fengshikun · 2024-07-16T07:41:05Z

Below is the complete training log file for your reference.
big_score_model_resume.log

plainerman · 2024-07-16T07:53:54Z

Could you try --limit_complexes 100 and setting the train set to validation set?

i.e.
--split_train data/splits/timesplit_no_lig_overlap_val_aligned --split_val data/splits/timesplit_no_lig_overlap_val_aligned --limit_complexes 100

And see if the problem persists?
With this, we can see whether it can overfit on a small subset.

fengshikun · 2024-07-16T08:15:06Z

Could you try --limit_complexes 100 and setting the train set to validation set?

i.e. --split_train data/splits/timesplit_no_lig_overlap_val_aligned --split_val data/splits/timesplit_no_lig_overlap_val_aligned --limit_complexes 100

And see if the problem persists? With this, we can see whether it can overfit on a small subset.

Thank you. I'll try it out and will share the results later.

plainerman · 2024-07-16T08:15:42Z

You have to comment out this line for it to work

DiffDock-Pocket/datasets/pdbbind.py

Line 988 in 98a1523

    
           assert not bool(set(complexes_train) & set(complexes_val)), "Train and val splits have overlapping complexes"

plainerman · 2024-07-16T08:17:10Z

Maybe related: #6 (which has been fixed).
So maybe it is worth to pull again

fengshikun · 2024-07-16T08:29:15Z

Maybe related: #6 (which has been fixed). So maybe it is worth to pull again

Got it, thanks for the remainder

fengshikun · 2024-07-17T06:33:52Z

Maybe related: #6 (which has been fixed). So maybe it is worth to pull again

I have pulled the newest version of the codebase and trained the scoring model using only 100 complex structures. However, the loss continues to fluctuate and has not converged yet. The training command used is as follows:

python -u train.py --run_name big_score_model --test_sigma_intervals --log_dir workdir --lr 1e-3 --tr_sigma_min 0.1 --tr_sigma_max 5 --rot_sigma_min 0.03 --rot_sigma_max 1.55 --tor_sigma_min 0.03 --sidechain_tor_sigma_min 0.03 --batch_size 32 --ns 60 --nv 10 --num_conv_layers 6 --distance_embed_dim 64 --cross_distance_embed_dim 64 --sigma_embed_dim 64 --dynamic_max_cross --scheduler plateau --scale_by_sigma --dropout 0.1 --sampling_alpha 1 --sampling_beta 1 --remove_hs --c_alpha_max_neighbors 24 --atom_max_neighbors 8 --receptor_radius 15 --num_dataloader_workers 1 --cudnn_benchmark --rot_alpha 1 --rot_beta 1 --tor_alpha 1 --tor_beta 1 --val_inference_freq 5 --use_ema --scheduler_patience 30 --n_epochs 750 --all_atom --sh_lmax 1 --sh_lmax 1 --split_train data/splits/timesplit_no_lig_overlap_val_aligned --split_val data/splits/timesplit_no_lig_overlap_val_aligned --limit_complexes 100 --pocket_reduction --pocket_buffer 10 --flexible_sidechains --flexdist 3.5 --flexdist_distance_metric prism --protein_file protein_esmfold_aligned_tr_fix --compare_true_protein --conformer_match_sidechains --conformer_match_score exp --match_max_rmsd 2 --use_original_conformer_fallback --use_original_conformer

The complete training log is provided below:
big_score_model.log

glukhove · 2024-08-07T20:05:18Z

@fengshikun hi, were you able to train the model?

fengshikun · 2024-08-21T03:28:15Z

@fengshikun hi, were you able to train the model?

The loss still cannot converge.

plainerman · 2024-10-04T09:19:00Z

Sorry for not getting back to you sooner. I don't have any concrete results yet, but I think there might be an issue when I ported parts of our code base and changing things on cuda.
On CPU I am able to overfit on individual samples.

I will see if I can pinpoint this issue. Any help is much appreciated, as I don't have much time for this project nowadays.

plainerman added the question Further information is requested label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The train loss cannot convergence #7

The train loss cannot convergence #7

fengshikun commented Jul 16, 2024 •

edited

Loading

plainerman commented Jul 16, 2024

fengshikun commented Jul 16, 2024

plainerman commented Jul 16, 2024

fengshikun commented Jul 16, 2024

fengshikun commented Jul 16, 2024

plainerman commented Jul 16, 2024

fengshikun commented Jul 16, 2024

plainerman commented Jul 16, 2024

plainerman commented Jul 16, 2024 •

edited

Loading

fengshikun commented Jul 16, 2024

fengshikun commented Jul 17, 2024 •

edited

Loading

glukhove commented Aug 7, 2024

fengshikun commented Aug 21, 2024

plainerman commented Oct 4, 2024

The train loss cannot convergence #7

The train loss cannot convergence #7

Comments

fengshikun commented Jul 16, 2024 • edited Loading

plainerman commented Jul 16, 2024

fengshikun commented Jul 16, 2024

plainerman commented Jul 16, 2024

fengshikun commented Jul 16, 2024

fengshikun commented Jul 16, 2024

plainerman commented Jul 16, 2024

fengshikun commented Jul 16, 2024

plainerman commented Jul 16, 2024

plainerman commented Jul 16, 2024 • edited Loading

fengshikun commented Jul 16, 2024

fengshikun commented Jul 17, 2024 • edited Loading

glukhove commented Aug 7, 2024

fengshikun commented Aug 21, 2024

plainerman commented Oct 4, 2024

fengshikun commented Jul 16, 2024 •

edited

Loading

plainerman commented Jul 16, 2024 •

edited

Loading

fengshikun commented Jul 17, 2024 •

edited

Loading