Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hemichannel Test: Memory Issues (After the Chunking Update) #71

Open
amelie-iska opened this issue Nov 29, 2024 · 10 comments
Open

Hemichannel Test: Memory Issues (After the Chunking Update) #71

amelie-iska opened this issue Nov 29, 2024 · 10 comments

Comments

@amelie-iska
Copy link

amelie-iska commented Nov 29, 2024

Hi all, just ran into this error on a hemichannel (6 connexin) system (same as before). I can run this prediction with ColabFold, but not with Boltz-1.
YAML Input:

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: [A,B,C,D,E,F]
      sequence: MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTPTLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYGFSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDGIKGKKDPFSATNDAVISGKECGSPKYAYFNGCSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLTILDQRPSSRASSHASSRPRPDDLEI
  - protein:
      id: [G,H,I]
      sequence: FSLESERP
  - ligand:
      id: [J,K,L]
      smiles: CC(C)C[C@H](NC(=O)[C@H](CO)NC(=O)[C@@H](N)Cc1ccccc1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CO)C(=O)N[C@@H](CCC(=O)O)C(=O)N[C@@H](CCCNC(=N)N)C(=O)N1CCC[C@H]1C(=O)O

Run Command:

boltz predict examples/connexin-peptide.yaml --recycling_steps 20 --diffusion_samples 5 --use_msa_server

Output:

(boltz-1) lily@il-gpu04:~/amelie/Workspace/boltz$ boltz predict examples/connexin-peptide.yaml --recycling_steps 20 --diffusion_samples 5 --use_msa_server
Downloading the model weights to /home/lily/.boltz/boltz1_conf.ckpt. You may change the cache directory with the --cache flag.
Checking input data.
Running predictions for 1 structure
Processing input data.
  0%|                                                                                                                                             | 0/1 [00:00<?, ?it/s]Generating MSA for examples/connexin-peptide.yaml with 2 protein entities.
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:01 remaining: 00:00]
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:00 remaining: 00:00Sleeping for 8s. Reason: PENDING                                                                                                    | 0/300 [elapsed: 00:00 remaining: ?]
                                                                                                                                                                       Sleeping for 7s. Reason: RUNNING                                                                                                | 8/300 [elapsed: 00:09 remaining: 05:33]
                                                                                                                                                                       Sleeping for 9s. Reason: RUNNING                                                                                               | 15/300 [elapsed: 00:16 remaining: 05:15]
                                                                                                                                                                       Sleeping for 9s. Reason: RUNNING                                                                                               | 24/300 [elapsed: 00:26 remaining: 04:59]
                                                                                                                                                                       Sleeping for 8s. Reason: RUNNING                                                                                               | 33/300 [elapsed: 00:35 remaining: 04:47]
COMPLETE: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [elapsed: 00:45 remaining: 00:00]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:48<00:00, 48.80s/it]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/lily/mambaforge/envs/boltz-1/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Predicting DataLoader 0:   0%|                                                                                                                    | 0/1 [00:00<?, ?it/s]| WARNING: ran out of memory, skipping batch
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [2:34:20<00:00,  0.00it/s]Number of failed examples: 1
Predicting DataLoader 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [2:34:20<00:00,  0.00it/s]
(boltz-1) lily@il-gpu04:~/amelie/Workspace/boltz$ 
@xinyu-dev
Copy link

Same here

@YogBN
Copy link

YogBN commented Dec 3, 2024

same issues with OOM.

@YaoYinYing
Copy link

same issue here, OOM with 2-chain protein complex(<500aa in total) on A100 (40 GB)

@jwohlwend
Copy link
Owner

We just released v0.3.2 which should address some of these issues. You can update with pip install boltz -U When testing, please remove any existing output folder for your input and run again! Please let us know.

@YaoYinYing
Copy link

v0.3.2 works for my case!!!

@amelie-iska
Copy link
Author

IT WORKED!!! 🔥 🔥 🔥

image

@amelie-iska
Copy link
Author

amelie-iska commented Dec 5, 2024

I did have to still truncate the last ~140 residues from the C-terminus of the connexins though. So, I ran with this YAML

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: [A,B,C,D,E,F]
      sequence: MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTPTLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYGFSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDG

# Long disordered C-terminal tail of connexin
# IKGKKDPFSATNDAVISGKECGSPKYAYFNGCSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLTILDQRPSSRASSHASSRPRPDDLEI

# Run command: 
# boltz predict examples/connexin-peptide.yaml --recycling_steps 20  --diffusion_samples 10 --use_msa_server

Also, I am alleviating memory issues by adding this code (below) to src/boltz/main.py...will this help?

import torch
torch.set_float32_matmul_precision('medium')

I'm rerunning with the full 379 residue connexins now and will report back with an update once it either finishes or fails.

@amelie-iska
Copy link
Author

😔

@zongmingchua
Copy link

I did have to still truncate the last ~140 residues from the C-terminus of the connexins though. So, I ran with this YAML

version: 1  # Optional, defaults to 1
sequences:
  - protein:
      id: [A,B,C,D,E,F]
      sequence: MGDWSALGRLLDKVQAYSTAGGKVWLSVLFIFRILLLGTAVESAWGDEQSAFVCNTQQPGCENVCYDKSFPISHVRFWVLQIIFVSTPTLLYLAHVFYLMRKEEKLNRKEEELKMVQNEGGNVDMHLKQIEIKKFKYGLEEHGKVKMRGGLLRTYIISILFKSVFEVGFIIIQWYMYGFSLSAIYTCKRDPCPHQVDCFLSRPTEKTIFIWFMLIVSIVSLALNIIELFYVTYKSIKDG

# Long disordered C-terminal tail of connexin
# IKGKKDPFSATNDAVISGKECGSPKYAYFNGCSSPTAPMSPPGYKLVTGERNPSSCRNYNKQASEQNWANYSAEQNRMGQAGSTISNTHAQPFDFSDEHQNTKKMAPGHEMQPLTILDQRPSSRASSHASSRPRPDDLEI

# Run command: 
# boltz predict examples/connexin-peptide.yaml --recycling_steps 20  --diffusion_samples 10 --use_msa_server

Also, I am alleviating memory issues by adding this code (below) to src/boltz/main.py...will this help?

import torch
torch.set_float32_matmul_precision('medium')

I'm rerunning with the full 379 residue connexins now and will report back with an update once it either finishes or fails.

hi! curious about your reason for using --recycling_steps 20 --diffusion_samples 10 - do the results work better compared to the default parameters?

@amelie-iska
Copy link
Author

amelie-iska commented Dec 18, 2024

Hi @zongmingchua
In general, you can expect that raising the recycles will improve output prediction quality. Increasing the number of seeds/samples also increases your chances of getting a good prediction. So, for larger, more complex systems, I generally do not use the default settings. Another thing you might try is increasing the number of timesteps used in the diffusion process, which should also improve quality. All of these things will increase the amount of time it takes to run though. So just keep that in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants