Skip to content

Commit

Permalink
Merge pull request #105 from jwohlwend/pipeline
Browse files Browse the repository at this point in the history
Preprocessing pipeline
  • Loading branch information
jwohlwend authored Dec 21, 2024
2 parents 2d9d45b + ff681f2 commit e75cbdc
Show file tree
Hide file tree
Showing 10 changed files with 2,203 additions and 3 deletions.
190 changes: 188 additions & 2 deletions docs/training.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,41 @@
# Training

## Download processed data
## Download the pre-processed data

Instructions on how to download the processed dataset for training are coming soon, we are currently uploading the data to sharable storage and will update this page when ready.
To run training, you will need to download a a few pre-processed datasets. Note that you will need ~250G of storage for all the data. If instead you want to re-run the preprocessing pipeline or processed your own raw data for training, please see the [instructions](#processing-raw-data) at the bottom of this page.

- The pre-processed RCSB (i.e PDB) structures:
```bash
wget https://boltz1.s3.us-east-2.amazonaws.com/rcsb_processed_targets.tar
tar -xf rcsb_processed_targets.tar
rm rcsb_processed_targets.tar
```

- The pre-processed RCSB (i.e PDB) MSA's:
```bash
wget https://boltz1.s3.us-east-2.amazonaws.com/rcsb_processed_msa.tar
tar -xf rcsb_processed_msa.tar
rm rcsb_processed_msa.tar
```

- The pre-processed OpenFold structures:
```bash
wget https://boltz1.s3.us-east-2.amazonaws.com/openfold_processed_targets.tar
tar -xf openfold_processed_targets.tar
rm openfold_processed_targets.tar
```

- The pre-processed OpenFold MSA's:
```bash
wget https://boltz1.s3.us-east-2.amazonaws.com/openfold_processed_msa.tar
tar -xf openfold_processed_msa.tar
rm openfold_processed_msa.tar
```

- The pre-computed symmetry files for ligands:
```bash
wget https://boltz1.s3.us-east-2.amazonaws.com/symmetries.pkl
```

## Modify the configuration file

Expand Down Expand Up @@ -45,3 +78,156 @@ Once that seems to run okay, you can kill it and launch the training run:
We also provide a different configuration file to train the confidence model:

python scripts/train/train.py scripts/train/configs/confidence.yaml


## Processing raw data

We have already pre-processed the training data for the PDB and the OpenFold self-distillation set. However, if you'd like to replicate the processing pipeline or processed your own data for training, you can follow the instructions below.


#### Step 1: Go to the processing folder

```bash
cd scripts/process
```

#### Step 2: Install requirements

Install the few extra requirements required for processing:

```bash
pip install -r requirements.txt
```

You must also install two external libraries: `mmseqs` and `redis`. Instructions for installation are below:

- `mmseqs`: https://github.com/soedinglab/mmseqs2?tab=readme-ov-file#installation
- `redis`: https://redis.io/docs/latest/operate/oss_and_stack/install/install-redis/

#### Step 3: Preprocess the CCD dictionary


We have already done this for you, the relevant file is here:
```bash
wget https://boltz1.s3.us-east-2.amazonaws.com/ccd.pkl
```

Unless you wish to do it again yourself, you can skip to the next step! If you do want to recreate the file, you can do so with the following commands:

```bash
wget https://files.wwpdb.org/pub/pdb/data/monomers/components.cif
python ccd.py --components components.cif --outdir ./ccd
```

> Note: runs in parallel by default with as many threads as cpu cores on your machine, can be changed with `--num_processes`

#### Step 4: Create sequence clusters

First, you must create a fasta file containing all the polymer sequences present in your data. You can use any header format you want for the sequences, it will not be used.

For the PDB, this can already be downloaded here:
```bash
wget https://files.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt.gz
gunzip -d pdb_seqres.txt.gz
```

> Note: for the OpenFold data, since the sequences were chosen for diversity, we do not apply any clustering.

When this is done, you can run the clustering script, which assigns proteins to 40% similarity clusters and rna/dna to a cluster for each unique sequence. For ligands, each CCD code is also assigned to its own cluster.

```bash
python cluster.py --ccd ccd.pkl --sequences pdb_seqres.txt --mmseqs PATH_TO_MMSEQS_EXECUTABLE --output ./clustering
```

> Note: you must install mmseqs (see: https://github.com/soedinglab/mmseqs2?tab=readme-ov-file#installation)

#### Step 5: Create MSA's

We have already computed MSA's for all sequences in the PDB at the time of training using the ColabFold `colab_search` tool. You can setup your own local colabfold using instructions provided here: https://github.com/YoshitakaMo/localcolabfold

The raw MSA's for the PDB can be found here:
```
wget https://boltz1.s3.us-east-2.amazonaws.com/rcsb_raw_msa.tar
tar -xf rcsb_raw_msa.tar
rm rcsb_raw_msa.tar
```
> Note: this file is 130G large, and will take another 130G to extract before you can delete the original tar archive, we make sure you have enough storage on your machine.

You can also download the raw OpenFold MSA's here:
```
wget https://boltz1.s3.us-east-2.amazonaws.com/openfold_raw_msa.tar
tar -xf openfold_raw_msa.tar
rm openfold_raw_msa.tar
```

> Note: this file is 88G large, and will take another 88G to extract before you can delete the original tar archive, we make sure you have enough storage on your machine.

If you wish to use your own MSA's, just ensure that their file name is the hash of the query sequence, according to the following function:
```python
import hashlib
def hash_sequence(seq: str) -> str:
"""Hash a sequence."""
return hashlib.sha256(seq.encode()).hexdigest()
```

#### Step 6: Process MSA's

During MSA processing, among other things, we annotate sequences using their taxonomy ID, which is important for MSA pairing during training. This happens only on MSA sequences with headers that start with the following:

```
>UniRef100_UNIREFID
...
```

This format is the way that MSA's are provided by colabfold. If you use a different MSA pipeline, make sure your Uniref MSA's follow the above format.

Next you should download our provided taxonomy database and place it in the current folder:

```bash
wget https://boltz1.s3.us-east-2.amazonaws.com/taxonomy.rdb
```

You can now process the raw MSAs. First launch a redis server. We use redis to share the large taxonomy dictionary across workers, so MSA processing can happen in parallel without blowing up the RAM usage.

```bash
redis-server --dbfilename taxonomy.rdb --redis-port 7777
```

Please wait a few minutes for the DB to initialize. It will print `Ready to accept connections` when ready.

> Note: You must have redis installed (see: https://redis.io/docs/latest/operate/oss_and_stack/install/install-redis/)

In a separate shell, run the MSA processing script:
```bash
python msa.py --msadir YOUR_MSA_DIR --outdir YOUR_OUTPUT_DIR --redis-port 7777
```

> Important: the script looks for `.a3m` or `.a3m.gz` files in the directory, make sure to match this extension and file format.

#### Step 7: Process structures

Finally, we're ready to process structural data. Here we provide two different scripts for the PDB and for the OpenFold data. In general, we recommend using the `rcsb.py` script for your own data, which is expected in `mmcif` format.

You can download the full RCSB using the instructions here:
https://www.rcsb.org/docs/programmatic-access/file-download-services


```bash
wget https://boltz1.s3.us-east-2.amazonaws.com/ccd.rdb
redis-server --dbfilename ccd.rdb --redis-port 7777
```
> Note: You must have redis installed (see: https://redis.io/docs/latest/operate/oss_and_stack/install/install-redis/)

In a separate shell, run the processing script, make sure to use the `clustering/clustering.json` file you previously created.
```bash
python rcsb.py --datadir PATH_TO_MMCIF_DIR --cluster clustering/clustering.json --outdir YOUR_OUTPUT_DIR --use-assembly --max-file-size 7000000 --redis-port 7777
```

> Important: the script looks for `.cif` or `cif.gz` files in the directory, make sure to match this extension and file format.

> We skip a few of the very large files, you can modify this using the `--max-file-size` flag, or by removing it.

#### Step 8: Ready!

You're ready to start training the model on your data, make sure to modify the config to assign the paths you created in the previous two steps. If you have any questions, don't hesitate to open an issue or reach out on our community slack channel.
1 change: 1 addition & 0 deletions scripts/process/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Please see our [data processing instructions](docs/training.md).
Loading

0 comments on commit e75cbdc

Please sign in to comment.