boltz-1

jwohlwend · Nov 17, 2024 · 2deeafa · 2deeafa
commit 2deeafa
Show file tree

Hide file tree

Showing 90 changed files with 18,517 additions and 0 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 Jeremy Wohlwend, Gabriele Corso, Saro Passaro
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,65 @@
+<h1 align="center">Boltz-1:
+
+Democratizing Biomolecular Interaction Modeling
+</h1>
+
+![](docs/boltz1_pred_figure.png)
+
+Boltz-1 is an open-source model which predicts the 3D structure of proteins, rna, dna and small molecules; it handles modified residues, covalent ligands and glycans, as well as condition the generation on pocket residues. 
+
+For more information about the model, see our [technical report](https://gcorso.github.io/assets/boltz1.pdf).
+
+## Installation
+Install boltz with PyPI (recommended):
+
+```
+pip install boltz
+```
+
+or directly from GitHub for daily updates:
+
+```
+git clone https://github.com/jwohlwend/boltz.git
+cd boltz; pip install -e .
+```
+> Note: we recommend installing boltz in a fresh python environment
+
+## Inference
+
+You can run inference using Boltz-1 with:
+
+```
+boltz predict input_path
+```
+
+Boltz currently accepts three input formats:
+
+1. Fasta file, for most use cases
+
+2. A comprehensive YAML schema, for more complex use cases
+
+3. A directory containing files of the above formats, for batched processing
+
+To see all available options: `boltz predict --help` and for more informaton on these input formats, see our [prediction instructions](docs/prediction.md).
+
+## Training
+
+If you're interested in retraining the model, see our [training instructions](docs/training.md).
+
+## Contributing
+
+We welcome external contributions and are eager to engage with the community. Connect with us on our [Slack channel](https://boltz-community.slack.com/archives/C0818M6DWH2) to discuss advancements, share insights, and foster collaboration around Boltz-1.
+
+## Coming very soon
+
+- [ ] Pocket conditioning support
+- [ ] More examples
+- [ ] Full data processing pipeline
+- [ ] Colab notebook for inference
+- [ ] Confidence model checkpoint
+- [ ] Support for custom paired MSA
+- [ ] Kernel integration
+
+## License
+
+Our model and code are released under MIT License, and can be freely used for both academic and commercial purposes.
diff --git a/docs/boltz1_pred_figure.png b/docs/boltz1_pred_figure.png
diff --git a/docs/prediction.md b/docs/prediction.md
@@ -0,0 +1,140 @@
+# Prediction
+
+Once you have installed `boltz`, you can start making predictions by simply running:
+
+`boltz predict <INPUT_PATH>`
+
+where `<INPUT_PATH>` is a path to the input file or a directory. The input file can either be in fasta (enough for most use cases) or YAML  format (for more complex inputs). If you specify a directory, `boltz` will run predictions on each `.yaml` or `.fasta` file in the directory.
+
+Before diving into more details about the input formats, here are the key differences in what they each support:
+
+| Feature  | Fasta              | YAML    |
+| -------- |--------------------| ------- |
+| Polymers | :white_check_mark: | :white_check_mark:   |
+| Smiles   | :white_check_mark: | :white_check_mark:   |
+| CCD code | :white_check_mark: | :white_check_mark:   |
+| Custom MSA | :white_check_mark: | :white_check_mark:   |
+| Modified Residues | :x:                |  :white_check_mark: |
+| Covalent bonds | :x:                | :white_check_mark:   |
+| Pocket conditioning | :x:                | :white_check_mark:   |
+
+
+
+## Fasta format
+
+The fasta format should contain entries as follows:
+
+```
+>CHAIN_ID|ENTITY_TYPE|MSA_PATH
+SEQUENCE
+```
+
+Where `CHAIN_ID` is a unique identifier for each input chain, `ENTITY_TYPE` can be one of `protein`, `dna`, `rna`, `smiles`, `ccd` and `MSA_PATH` is only specified for protein entities and is the path to the `.a3m` file containing a computed MSA for the sequence of the protein. Note that we support both smiles and CCD code for ligands. 
+
+For each of these cases, the corresponding `SEQUENCE` will contain an amino acid sequence (e.g. `EFKEAFSLF`), a sequence of nucleotide bases (e.g. `ATCG`), a smiles string (e.g. `CC1=CC=CC=C1`), or a CCD code (e.g. `ATP`), depending on the entity.
+
+As an example:
+
+```yaml
+>A|protein|./examples/msa/seq1.a3m
+MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
+>B|protein|./examples/msa/seq1.a3m
+MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
+>C|ccd
+SAH
+>D|ccd
+SAH
+>E|smiles
+N[C@@H](Cc1ccc(O)cc1)C(=O)O
+>F|smiles
+N[C@@H](Cc1ccc(O)cc1)C(=O)O
+```
+
+
+## YAML format
+
+The YAML format is more flexible and allows for more complex inputs, particularly around covalent bonds. The schema of the YAML is the following:
+
+```yaml
+sequences:
+    - ENTITY_TYPE:
+        id: CHAIN_ID 
+        sequence: SEQUENCE    # only for protein, dna, rna
+        smiles: SMILES        # only for ligand, exclusive with ccd
+        ccd: CCD              # only for ligand, exclusive with smiles
+        msa: MSA_PATH         # only for protein
+        modifications:
+          - position: RES_IDX   # index of residue, starting from 1
+            ccd: CCD            # CCD code of the modified residue
+
+    - ENTITY_TYPE:
+        id: [CHAIN_ID, CHAIN_ID]    # multiple ids in case of multiple identical entities
+        ...
+constraints:
+    - bond:
+        atom1: [CHAIN_ID, RES_IDX, ATOM_NAME]
+        atom2: [CHAIN_ID, RES_IDX, ATOM_NAME]
+    - pocket:
+        binder: CHAIN_ID
+        contacts: [[CHAIN_ID, RES_IDX], [CHAIN_ID, RES_IDX]]
+```
+`sequences` has one entry for every unique chain/molecule in the input. Each polymer entity as a `ENTITY_TYPE`  either `protein`, `dna` or`rna` and have a `sequence` attribute. Non-polymer entities are indicated by `ENTITY_TYPE` equal to `ligand` and have a `smiles` or `ccd` attribute. `CHAIN_ID` is the unique identifier for each chain/molecule, and it should be set as a list in case of multiple identical entities in the structure. Protein entities should also contain an `msa` attribute with `MSA_PATH` indicating the path to the `.a3m` file containing a computed MSA for the sequence of the protein.
+
+The `modifications` field is an optional field that allows you to specify modified residues in the polymer (`protein`, `dna` or`rna`). The `position` field specifies the index (starting from 1) of the residue, and `ccd` is the CCD code of the modified residue. This field is currently only supported for CCD ligands.
+
+`constraints` is an optional field that allows you to specify additional information about the input structure. Currently, we support just `bond`. The `bond` constraint specifies a covalent bonds between two atoms (`atom1` and `atom2`). It is currently only supported for CCD ligands and canonical residues, `CHAIN_ID` refers to the id of the residue set above, `RES_IDX` is the index (starting from 1) of the residue (1 for ligands), and `ATOM_NAME` is the standardized atom name (can be verified in CIF file of that component on the RCSB website).
+
+As an example:
+
+```yaml
+version: 1
+sequences:
+  - protein:
+      id: [A, B]
+      sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
+      msa: ./examples/msa/seq1.a3m
+  - ligand:
+      id: [C, D]
+      ccd: SAH
+  - ligand:
+      id: [E, F]
+      smiles: N[C@@H](Cc1ccc(O)cc1)C(=O)O
+```
+
+
+## Options
+
+The following options are available for the `predict` command:
+
+    boltz predict [OPTIONS] input_path
+
+| **Option**                  | **Type**        | **Default**        | **Description**                                                                 |
+|-----------------------------|-----------------|--------------------|---------------------------------------------------------------------------------|
+| `--out_dir PATH`            | `PATH`          | `./`             | The path where to save the predictions.                                         |
+| `--cache PATH`              | `PATH`          | `~/.boltz`         | The directory where to download the data and model.                             |
+| `--checkpoint PATH`         | `PATH`          | None      | An optional checkpoint. Uses the provided Boltz-1 model by default.             |
+| `--devices INTEGER`         | `INTEGER`       | `1`                | The number of devices to use for prediction.                                    |
+| `--accelerator`             | `[gpu,cpu,tpu]` | `gpu`              | The accelerator to use for prediction.                                          |
+| `--recycling_steps INTEGER` | `INTEGER`       | `3`                | The number of recycling steps to use for prediction.                            |
+| `--sampling_steps INTEGER`  | `INTEGER`       | `200`              | The number of sampling steps to use for prediction.                             |
+| `--diffusion_samples INTEGER` | `INTEGER`       | `1`                | The number of diffusion samples to use for prediction.                          |
+| `--output_format`           | `[pdb,mmcif]`   | `mmcif`            | The output format to use for the predictions.                                   |
+| `--num_workers INTEGER`     | `INTEGER`       | `2`                | The number of dataloader workers to use for prediction.                         |
+| `--override`                | `FLAG`          | `False`            | Whether to override existing predictions if found.                              |
+
+## Output
+
+After running the model, the generated outputs are organized into the output directory following the structure below:
+```
+out_dir/
+├── lightning_logs/                                            # Logs generated during training or evaluation
+├── predictions/                                               # Contains the model's predictions
+    ├── [input_file1]/
+        ├── [input_file1]_model_0.cif                          # The predicted structure in CIF format
+        ...
+        └── [input_file1]_model_[diffusion_samples-1].cif      # The predicted structure in CIF format
+    └── [input_file2]/
+        ...
+└── processed/                                                 # Processed data used during execution 
+```
+The `predictions` folder contains a unique folder for each input file. The input folders contain diffusion_samples predictions saved in the output_format. The `processed` folder contains the processed input files that are used by the model during inference.
diff --git a/docs/training.md b/docs/training.md
@@ -0,0 +1,47 @@
+# Training
+
+## Download processed data
+
+Instructions on how to download the processed dataset for training are coming soon, we are currently uploading the data to sharable storage and will update this page when ready.
+
+## Modify the configuration file
+
+The training script requires a configuration file to run. This file specifies the paths to the data, the output directory, and other parameters of the data, model and training process. 
+
+We provide under `scripts/train/configs` a template configuration file analogous to the one we used for training the structure model (`structure.yaml`) and the confidence model (`confidence.yaml`).
+
+The following are the main parameters that you should modify in the configuration file to get the structure model to train:
+
+```yaml
+trainer:
+  devices: 1
+
+output: SET_PATH_HERE                 # Path to the output directory  
+resume: PATH_TO_CHECKPOINT_FILE       # Path to a checkpoint file to resume training from if any null otherwise
+
+data:
+  datasets:
+    - _target_: boltz.data.module.training.DatasetConfig
+      target_dir: PATH_TO_TARGETS_DIR       # Path to the directory containing the processed structure files
+      msa_dir: PATH_TO_MSA_DIR              # Path to the directory containing the processed MSA files
+
+  symmetries: PATH_TO_SYMMETRY_FILE      # Path to the file containing molecule the symmetry information
+  max_tokens: 512                        # Maximum number of tokens in the input sequence
+  max_atoms: 4608                        # Maximum number of atoms in the input structure
+```
+
+`max_tokens` and `max_atoms` are the maximum number of tokens and atoms in the crop. Depending on the size of the GPUs you are using (as well as the training speed desired), you may want to adjust these values. Other recommended values are 256 and 2304, or 384 and 3456 respectively.
+
+## Run the training script
+
+Before running the full training, we recommend using the debug flag. This turns off DDP (sets single device) and set `num_workers` to 0 so everything is in a single process, as well as disabling wandb:
+
+    python scripts/train/train.py scripts/train/configs/structure.yaml debug=1
+
+Once that seems to run okay, you can kill it and launch the training run:
+
+    python scripts/train/train.py scripts/train/configs/structure.yaml
+
+We also provide a different configuration file to train the confidence model:
+
+    python scripts/train/train.py scripts/train/configs/confidence.yaml
diff --git a/examples/ligand.fasta b/examples/ligand.fasta
@@ -0,0 +1,12 @@
+>A|protein|./examples/msa/seq1.a3m
+MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
+>B|protein|./examples/msa/seq1.a3m
+MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
+>C|ccd
+SAH
+>D|ccd
+SAH
+>E|smiles
+N[C@@H](Cc1ccc(O)cc1)C(=O)O
+>F|smiles
+N[C@@H](Cc1ccc(O)cc1)C(=O)O
diff --git a/examples/ligand.yaml b/examples/ligand.yaml
@@ -0,0 +1,12 @@
+version: 1  # Optional, defaults to 1
+sequences:
+  - protein:
+      id: [A, B]
+      sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
+      msa: ./examples/msa/seq1.a3m
+  - ligand:
+      id: [C, D]
+      ccd: SAH
+  - ligand:
+      id: [E, F]
+      smiles: N[C@@H](Cc1ccc(O)cc1)C(=O)O