Skip to content

Commit

Permalink
boltz-1
Browse files Browse the repository at this point in the history
  • Loading branch information
jwohlwend committed Nov 17, 2024
0 parents commit 2deeafa
Show file tree
Hide file tree
Showing 90 changed files with 18,517 additions and 0 deletions.
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Jeremy Wohlwend, Gabriele Corso, Saro Passaro

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
65 changes: 65 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
<h1 align="center">Boltz-1:

Democratizing Biomolecular Interaction Modeling
</h1>

![](docs/boltz1_pred_figure.png)

Boltz-1 is an open-source model which predicts the 3D structure of proteins, rna, dna and small molecules; it handles modified residues, covalent ligands and glycans, as well as condition the generation on pocket residues.

For more information about the model, see our [technical report](https://gcorso.github.io/assets/boltz1.pdf).

## Installation
Install boltz with PyPI (recommended):

```
pip install boltz
```

or directly from GitHub for daily updates:

```
git clone https://github.com/jwohlwend/boltz.git
cd boltz; pip install -e .
```
> Note: we recommend installing boltz in a fresh python environment
## Inference

You can run inference using Boltz-1 with:

```
boltz predict input_path
```

Boltz currently accepts three input formats:

1. Fasta file, for most use cases

2. A comprehensive YAML schema, for more complex use cases

3. A directory containing files of the above formats, for batched processing

To see all available options: `boltz predict --help` and for more informaton on these input formats, see our [prediction instructions](docs/prediction.md).

## Training

If you're interested in retraining the model, see our [training instructions](docs/training.md).

## Contributing

We welcome external contributions and are eager to engage with the community. Connect with us on our [Slack channel](https://boltz-community.slack.com/archives/C0818M6DWH2) to discuss advancements, share insights, and foster collaboration around Boltz-1.

## Coming very soon

- [ ] Pocket conditioning support
- [ ] More examples
- [ ] Full data processing pipeline
- [ ] Colab notebook for inference
- [ ] Confidence model checkpoint
- [ ] Support for custom paired MSA
- [ ] Kernel integration

## License

Our model and code are released under MIT License, and can be freely used for both academic and commercial purposes.
Binary file added docs/boltz1_pred_figure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
140 changes: 140 additions & 0 deletions docs/prediction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Prediction

Once you have installed `boltz`, you can start making predictions by simply running:

`boltz predict <INPUT_PATH>`

where `<INPUT_PATH>` is a path to the input file or a directory. The input file can either be in fasta (enough for most use cases) or YAML format (for more complex inputs). If you specify a directory, `boltz` will run predictions on each `.yaml` or `.fasta` file in the directory.

Before diving into more details about the input formats, here are the key differences in what they each support:

| Feature | Fasta | YAML |
| -------- |--------------------| ------- |
| Polymers | :white_check_mark: | :white_check_mark: |
| Smiles | :white_check_mark: | :white_check_mark: |
| CCD code | :white_check_mark: | :white_check_mark: |
| Custom MSA | :white_check_mark: | :white_check_mark: |
| Modified Residues | :x: | :white_check_mark: |
| Covalent bonds | :x: | :white_check_mark: |
| Pocket conditioning | :x: | :white_check_mark: |



## Fasta format

The fasta format should contain entries as follows:

```
>CHAIN_ID|ENTITY_TYPE|MSA_PATH
SEQUENCE
```

Where `CHAIN_ID` is a unique identifier for each input chain, `ENTITY_TYPE` can be one of `protein`, `dna`, `rna`, `smiles`, `ccd` and `MSA_PATH` is only specified for protein entities and is the path to the `.a3m` file containing a computed MSA for the sequence of the protein. Note that we support both smiles and CCD code for ligands.

For each of these cases, the corresponding `SEQUENCE` will contain an amino acid sequence (e.g. `EFKEAFSLF`), a sequence of nucleotide bases (e.g. `ATCG`), a smiles string (e.g. `CC1=CC=CC=C1`), or a CCD code (e.g. `ATP`), depending on the entity.

As an example:

```yaml
>A|protein|./examples/msa/seq1.a3m
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
>B|protein|./examples/msa/seq1.a3m
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
>C|ccd
SAH
>D|ccd
SAH
>E|smiles
N[C@@H](Cc1ccc(O)cc1)C(=O)O
>F|smiles
N[C@@H](Cc1ccc(O)cc1)C(=O)O
```


## YAML format

The YAML format is more flexible and allows for more complex inputs, particularly around covalent bonds. The schema of the YAML is the following:

```yaml
sequences:
- ENTITY_TYPE:
id: CHAIN_ID
sequence: SEQUENCE # only for protein, dna, rna
smiles: SMILES # only for ligand, exclusive with ccd
ccd: CCD # only for ligand, exclusive with smiles
msa: MSA_PATH # only for protein
modifications:
- position: RES_IDX # index of residue, starting from 1
ccd: CCD # CCD code of the modified residue

- ENTITY_TYPE:
id: [CHAIN_ID, CHAIN_ID] # multiple ids in case of multiple identical entities
...
constraints:
- bond:
atom1: [CHAIN_ID, RES_IDX, ATOM_NAME]
atom2: [CHAIN_ID, RES_IDX, ATOM_NAME]
- pocket:
binder: CHAIN_ID
contacts: [[CHAIN_ID, RES_IDX], [CHAIN_ID, RES_IDX]]
```
`sequences` has one entry for every unique chain/molecule in the input. Each polymer entity as a `ENTITY_TYPE` either `protein`, `dna` or`rna` and have a `sequence` attribute. Non-polymer entities are indicated by `ENTITY_TYPE` equal to `ligand` and have a `smiles` or `ccd` attribute. `CHAIN_ID` is the unique identifier for each chain/molecule, and it should be set as a list in case of multiple identical entities in the structure. Protein entities should also contain an `msa` attribute with `MSA_PATH` indicating the path to the `.a3m` file containing a computed MSA for the sequence of the protein.

The `modifications` field is an optional field that allows you to specify modified residues in the polymer (`protein`, `dna` or`rna`). The `position` field specifies the index (starting from 1) of the residue, and `ccd` is the CCD code of the modified residue. This field is currently only supported for CCD ligands.

`constraints` is an optional field that allows you to specify additional information about the input structure. Currently, we support just `bond`. The `bond` constraint specifies a covalent bonds between two atoms (`atom1` and `atom2`). It is currently only supported for CCD ligands and canonical residues, `CHAIN_ID` refers to the id of the residue set above, `RES_IDX` is the index (starting from 1) of the residue (1 for ligands), and `ATOM_NAME` is the standardized atom name (can be verified in CIF file of that component on the RCSB website).

As an example:

```yaml
version: 1
sequences:
- protein:
id: [A, B]
sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
msa: ./examples/msa/seq1.a3m
- ligand:
id: [C, D]
ccd: SAH
- ligand:
id: [E, F]
smiles: N[C@@H](Cc1ccc(O)cc1)C(=O)O
```


## Options

The following options are available for the `predict` command:

boltz predict [OPTIONS] input_path

| **Option** | **Type** | **Default** | **Description** |
|-----------------------------|-----------------|--------------------|---------------------------------------------------------------------------------|
| `--out_dir PATH` | `PATH` | `./` | The path where to save the predictions. |
| `--cache PATH` | `PATH` | `~/.boltz` | The directory where to download the data and model. |
| `--checkpoint PATH` | `PATH` | None | An optional checkpoint. Uses the provided Boltz-1 model by default. |
| `--devices INTEGER` | `INTEGER` | `1` | The number of devices to use for prediction. |
| `--accelerator` | `[gpu,cpu,tpu]` | `gpu` | The accelerator to use for prediction. |
| `--recycling_steps INTEGER` | `INTEGER` | `3` | The number of recycling steps to use for prediction. |
| `--sampling_steps INTEGER` | `INTEGER` | `200` | The number of sampling steps to use for prediction. |
| `--diffusion_samples INTEGER` | `INTEGER` | `1` | The number of diffusion samples to use for prediction. |
| `--output_format` | `[pdb,mmcif]` | `mmcif` | The output format to use for the predictions. |
| `--num_workers INTEGER` | `INTEGER` | `2` | The number of dataloader workers to use for prediction. |
| `--override` | `FLAG` | `False` | Whether to override existing predictions if found. |

## Output

After running the model, the generated outputs are organized into the output directory following the structure below:
```
out_dir/
├── lightning_logs/ # Logs generated during training or evaluation
├── predictions/ # Contains the model's predictions
├── [input_file1]/
├── [input_file1]_model_0.cif # The predicted structure in CIF format
...
└── [input_file1]_model_[diffusion_samples-1].cif # The predicted structure in CIF format
└── [input_file2]/
...
└── processed/ # Processed data used during execution
```
The `predictions` folder contains a unique folder for each input file. The input folders contain diffusion_samples predictions saved in the output_format. The `processed` folder contains the processed input files that are used by the model during inference.
47 changes: 47 additions & 0 deletions docs/training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Training

## Download processed data

Instructions on how to download the processed dataset for training are coming soon, we are currently uploading the data to sharable storage and will update this page when ready.

## Modify the configuration file

The training script requires a configuration file to run. This file specifies the paths to the data, the output directory, and other parameters of the data, model and training process.

We provide under `scripts/train/configs` a template configuration file analogous to the one we used for training the structure model (`structure.yaml`) and the confidence model (`confidence.yaml`).

The following are the main parameters that you should modify in the configuration file to get the structure model to train:

```yaml
trainer:
devices: 1

output: SET_PATH_HERE # Path to the output directory
resume: PATH_TO_CHECKPOINT_FILE # Path to a checkpoint file to resume training from if any null otherwise

data:
datasets:
- _target_: boltz.data.module.training.DatasetConfig
target_dir: PATH_TO_TARGETS_DIR # Path to the directory containing the processed structure files
msa_dir: PATH_TO_MSA_DIR # Path to the directory containing the processed MSA files

symmetries: PATH_TO_SYMMETRY_FILE # Path to the file containing molecule the symmetry information
max_tokens: 512 # Maximum number of tokens in the input sequence
max_atoms: 4608 # Maximum number of atoms in the input structure
```
`max_tokens` and `max_atoms` are the maximum number of tokens and atoms in the crop. Depending on the size of the GPUs you are using (as well as the training speed desired), you may want to adjust these values. Other recommended values are 256 and 2304, or 384 and 3456 respectively.

## Run the training script

Before running the full training, we recommend using the debug flag. This turns off DDP (sets single device) and set `num_workers` to 0 so everything is in a single process, as well as disabling wandb:

python scripts/train/train.py scripts/train/configs/structure.yaml debug=1

Once that seems to run okay, you can kill it and launch the training run:

python scripts/train/train.py scripts/train/configs/structure.yaml

We also provide a different configuration file to train the confidence model:

python scripts/train/train.py scripts/train/configs/confidence.yaml
12 changes: 12 additions & 0 deletions examples/ligand.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
>A|protein|./examples/msa/seq1.a3m
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
>B|protein|./examples/msa/seq1.a3m
MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
>C|ccd
SAH
>D|ccd
SAH
>E|smiles
N[C@@H](Cc1ccc(O)cc1)C(=O)O
>F|smiles
N[C@@H](Cc1ccc(O)cc1)C(=O)O
12 changes: 12 additions & 0 deletions examples/ligand.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
version: 1 # Optional, defaults to 1
sequences:
- protein:
id: [A, B]
sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVMEAAHELGVFAALAEAPADSGELARRLDCDARAMRVLLDALYAYDVIDRIHDTNGFRYLLSAEARECLLPGTLFSLVGKFMHDINVAWPAWRNLAEVVRHGARDTSGAESPNGIAQEDYESLVGGINFWAPPIVTTLSRKLRASGRSGDATASVLDVGCGTGLYSQLLLREFPRWTATGLDVERIATLANAQALRLGVEERFATRAGDFWRGGWGTGYDLVLFANIFHLQTPASAVRLMRHAAACLAPDGLVAVVDQIVDADREPKTPQDRFALLFAASMTNTGGGDAYTFQEYEEWFTAAGLQRIETLDTPMHRILLARRATEPSAVPEGQASENLYFQ
msa: ./examples/msa/seq1.a3m
- ligand:
id: [C, D]
ccd: SAH
- ligand:
id: [E, F]
smiles: N[C@@H](Cc1ccc(O)cc1)C(=O)O
Loading

0 comments on commit 2deeafa

Please sign in to comment.