Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory issue with ~1200 AA protein on NVIDIA A100 #104

Open
bh0085 opened this issue Dec 15, 2024 · 1 comment
Open

Out of memory issue with ~1200 AA protein on NVIDIA A100 #104

bh0085 opened this issue Dec 15, 2024 · 1 comment

Comments

@bh0085
Copy link

bh0085 commented Dec 15, 2024

Hello! I am really enjoying experimenting with Boltz, but finding that I run into out of memory errors on my 40GB A100 (NVIDIA A100-SXM4-40GB) on any single proteins above about 1200 AA in length. I am using the most recent version of pip, 0.3.2 and just wondering if perhaps there were any parameter options to boltz (reducing sample steps, etc) or pytorch settings (such as reducing float sizes) which were worth trying.... or really anything else.

At first hiking up the shared memory available to my docker container worked, but now allowing 128gb, it seems like it is very likely a GPU memory issue. On a slightly smaller protein, boltz is taking 36GB of GPU memory.

Super excited to keep using this awesome tool, and great work to the team!


Relevant logs for context:

boltz-1  | Running predictions for 1 structure
boltz-1  | Processing input data.
boltz-1  | Generating MSA for /data/jobs/DATA-6/af_out_complexes/temp_MOT_HUM_RPP_B07_FAS1_FC_STD_pos1_MOT_HUM_RPP_B07_FAS1_FC_STD_pos2_MOT_HUM_RPP_B07_FAS1_FC_STD_pos3/MOT_HUM_RPP_B07_FAS1_FC_STD_pos1_MOT_HUM_RPP_B07_FAS1_FC_STD_pos2_MOT_HUM_RPP_B07_FAS1_FC_STD_pos3.fa with 1 protein entities.
boltz-1  |
boltz-1  | Predicting: |          | 0/? [00:00<?, ?it/s]
boltz-1  | Predicting:   0%|          | 0/1 [00:00<?, ?it/s]
boltz-1  | Predicting DataLoader 0:   0%|          | 0/1 [00:00<?, ?it/s]| WARNING: ran out of memory, skipping batch
boltz-1  |
boltz-1  | Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]Number of failed examples: 1
boltz-1  |
boltz-1  | Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]
boltz-1  |
boltz-1  | 2024-12-15 02:49:50,877 - boltz_watcher - WARNING - Command stderr:
boltz-1  |   0%|          | 0/1 [00:00<?, ?it/s]
boltz-1  | 100%|██████████| 1/1 [00:01<00:00,  1.10s/it]
boltz-1  | 100%|██████████| 1/1 [00:01<00:00,  1.10s/it]
boltz-1  | GPU available: True (cuda), used: True
boltz-1  | TPU available: False, using: 0 TPU cores
boltz-1  | HPU available: False, using: 0 HPUs

And parameter settings:
2024-12-15 03:02:54,901 - boltz_watcher - INFO - Running command: boltz predict /data/jobs/DATA-1/af_out_complexes/temp_MOT_HUM_RPP_B07_FAS1_MON_STD_pos1_MOT_HUM_RPP_B07_FAS1_MON_STD_pos2/MOT_HUM_RPP_B07_FAS1_MON_STD_pos1_MOT_HUM_RPP_B07_FAS1_MON_STD_pos2.fa --use_msa_server --out_dir /data/jobs/DATA-1/af_out_complexes --accelerator gpu --recycling_steps 3 --sampling_steps 200 --diffusion_samples 1

@bh0085 bh0085 changed the title Out of memory issue with A100 Out of memory issue with ~1200 AA protein on NVIDIA A100 Dec 15, 2024
@jwohlwend
Copy link
Owner

Could you please make sure to run this in a new output directory and let me know if it happens again? I suspect that you may be using an MSA that was preprocessed prior to the latest version. We've run tests and are able to pass 1800AA on a 40G GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants