Out of memory issue with ~1200 AA protein on NVIDIA A100 #104

bh0085 · 2024-12-15T02:50:53Z

Hello! I am really enjoying experimenting with Boltz, but finding that I run into out of memory errors on my 40GB A100 (NVIDIA A100-SXM4-40GB) on any single proteins above about 1200 AA in length. I am using the most recent version of pip, 0.3.2 and just wondering if perhaps there were any parameter options to boltz (reducing sample steps, etc) or pytorch settings (such as reducing float sizes) which were worth trying.... or really anything else.

At first hiking up the shared memory available to my docker container worked, but now allowing 128gb, it seems like it is very likely a GPU memory issue. On a slightly smaller protein, boltz is taking 36GB of GPU memory.

Super excited to keep using this awesome tool, and great work to the team!

Relevant logs for context:

boltz-1  | Running predictions for 1 structure
boltz-1  | Processing input data.
boltz-1  | Generating MSA for /data/jobs/DATA-6/af_out_complexes/temp_MOT_HUM_RPP_B07_FAS1_FC_STD_pos1_MOT_HUM_RPP_B07_FAS1_FC_STD_pos2_MOT_HUM_RPP_B07_FAS1_FC_STD_pos3/MOT_HUM_RPP_B07_FAS1_FC_STD_pos1_MOT_HUM_RPP_B07_FAS1_FC_STD_pos2_MOT_HUM_RPP_B07_FAS1_FC_STD_pos3.fa with 1 protein entities.
boltz-1  |
boltz-1  | Predicting: |          | 0/? [00:00<?, ?it/s]
boltz-1  | Predicting:   0%|          | 0/1 [00:00<?, ?it/s]
boltz-1  | Predicting DataLoader 0:   0%|          | 0/1 [00:00<?, ?it/s]| WARNING: ran out of memory, skipping batch
boltz-1  |
boltz-1  | Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]Number of failed examples: 1
boltz-1  |
boltz-1  | Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]
boltz-1  |
boltz-1  | 2024-12-15 02:49:50,877 - boltz_watcher - WARNING - Command stderr:
boltz-1  |   0%|          | 0/1 [00:00<?, ?it/s]
boltz-1  | 100%|██████████| 1/1 [00:01<00:00,  1.10s/it]
boltz-1  | 100%|██████████| 1/1 [00:01<00:00,  1.10s/it]
boltz-1  | GPU available: True (cuda), used: True
boltz-1  | TPU available: False, using: 0 TPU cores
boltz-1  | HPU available: False, using: 0 HPUs

And parameter settings:
2024-12-15 03:02:54,901 - boltz_watcher - INFO - Running command: boltz predict /data/jobs/DATA-1/af_out_complexes/temp_MOT_HUM_RPP_B07_FAS1_MON_STD_pos1_MOT_HUM_RPP_B07_FAS1_MON_STD_pos2/MOT_HUM_RPP_B07_FAS1_MON_STD_pos1_MOT_HUM_RPP_B07_FAS1_MON_STD_pos2.fa --use_msa_server --out_dir /data/jobs/DATA-1/af_out_complexes --accelerator gpu --recycling_steps 3 --sampling_steps 200 --diffusion_samples 1

The text was updated successfully, but these errors were encountered:

jwohlwend · 2024-12-15T18:06:14Z

Could you please make sure to run this in a new output directory and let me know if it happens again? I suspect that you may be using an MSA that was preprocessed prior to the latest version. We've run tests and are able to pass 1800AA on a 40G GPU.

bh0085 changed the title ~~Out of memory issue with A100~~ Out of memory issue with ~1200 AA protein on NVIDIA A100 Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory issue with ~1200 AA protein on NVIDIA A100 #104

Out of memory issue with ~1200 AA protein on NVIDIA A100 #104

bh0085 commented Dec 15, 2024 •

edited

Loading

jwohlwend commented Dec 15, 2024

Out of memory issue with ~1200 AA protein on NVIDIA A100 #104

Out of memory issue with ~1200 AA protein on NVIDIA A100 #104

Comments

bh0085 commented Dec 15, 2024 • edited Loading

jwohlwend commented Dec 15, 2024

bh0085 commented Dec 15, 2024 •

edited

Loading