Merge pull request #105 from jwohlwend/pipeline

Preprocessing pipeline
jwohlwend · Dec 21, 2024 · e75cbdc · e75cbdc
2 parents 2d9d45b + ff681f2
commit e75cbdc
Show file tree

Hide file tree

Showing 10 changed files with 2,203 additions and 3 deletions.
diff --git a/docs/training.md b/docs/training.md
@@ -1,8 +1,41 @@
 # Training
 
-## Download processed data
+## Download the pre-processed data
 
-Instructions on how to download the processed dataset for training are coming soon, we are currently uploading the data to sharable storage and will update this page when ready.
+To run training, you will need to download a a few pre-processed datasets. Note that you will need ~250G of storage for all the data. If instead you want to re-run the preprocessing pipeline or processed your own raw data for training, please see the [instructions](#processing-raw-data) at the bottom of this page.
+
+- The pre-processed RCSB (i.e PDB) structures:
+```bash
+wget https://boltz1.s3.us-east-2.amazonaws.com/rcsb_processed_targets.tar
+tar -xf rcsb_processed_targets.tar
+rm rcsb_processed_targets.tar
+```
+
+- The pre-processed RCSB (i.e PDB) MSA's:
+```bash
+wget https://boltz1.s3.us-east-2.amazonaws.com/rcsb_processed_msa.tar
+tar -xf rcsb_processed_msa.tar
+rm rcsb_processed_msa.tar
+```
+
+- The pre-processed OpenFold structures:
+```bash
+wget https://boltz1.s3.us-east-2.amazonaws.com/openfold_processed_targets.tar
+tar -xf openfold_processed_targets.tar
+rm openfold_processed_targets.tar
+```
+
+- The pre-processed OpenFold MSA's:
+```bash
+wget https://boltz1.s3.us-east-2.amazonaws.com/openfold_processed_msa.tar
+tar -xf openfold_processed_msa.tar
+rm openfold_processed_msa.tar
+```
+
+- The pre-computed symmetry files for ligands:
+```bash
+wget https://boltz1.s3.us-east-2.amazonaws.com/symmetries.pkl
+```
 
 ## Modify the configuration file
 
@@ -45,3 +78,156 @@ Once that seems to run okay, you can kill it and launch the training run:
 We also provide a different configuration file to train the confidence model:
 
     python scripts/train/train.py scripts/train/configs/confidence.yaml
+
+
+## Processing raw data
+
+We have already pre-processed the training data for the PDB and the OpenFold self-distillation set. However, if you'd like to replicate the processing pipeline or processed your own data for training, you can follow the instructions below.
+
+
+#### Step 1: Go to the processing folder
+
+```bash
+cd scripts/process
+```
+
+#### Step 2: Install requirements
+
+Install the few extra requirements required for processing:
+
+```bash
+pip install -r requirements.txt
+```
+
+You must also install two external libraries: `mmseqs` and `redis`. Instructions for installation are below:
+
+- `mmseqs`: https://github.com/soedinglab/mmseqs2?tab=readme-ov-file#installation
+- `redis`: https://redis.io/docs/latest/operate/oss_and_stack/install/install-redis/
+
+#### Step 3: Preprocess the CCD dictionary
+
+
+We have already done this for you, the relevant file is here:
+```bash
+wget https://boltz1.s3.us-east-2.amazonaws.com/ccd.pkl
+```
+
+Unless you wish to do it again yourself, you can skip to the next step! If you do want to recreate the file, you can do so with the following commands:
+
+```bash
+wget https://files.wwpdb.org/pub/pdb/data/monomers/components.cif
+python ccd.py --components components.cif --outdir ./ccd
+```
+
+> Note: runs in parallel by default with as many threads as cpu cores on your machine, can be changed with `--num_processes`
+
+#### Step 4: Create sequence clusters
+
+First, you must create a fasta file containing all the polymer sequences present in your data. You can use any header format you want for the sequences, it will not be used.
+
+For the PDB, this can already be downloaded here:
+```bash
+wget https://files.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt.gz
+gunzip -d pdb_seqres.txt.gz
+```
+
+> Note: for the OpenFold data, since the sequences were chosen for diversity, we do not apply any clustering.
+
+When this is done, you can run the clustering script, which assigns proteins to 40% similarity clusters and rna/dna to a cluster for each unique sequence. For ligands, each CCD code is also assigned to its own cluster.
+
+```bash
+python cluster.py --ccd ccd.pkl --sequences pdb_seqres.txt --mmseqs PATH_TO_MMSEQS_EXECUTABLE --output ./clustering
+```
+
+> Note: you must install mmseqs (see: https://github.com/soedinglab/mmseqs2?tab=readme-ov-file#installation)
+
+#### Step 5: Create MSA's
+
+We have already computed MSA's for all sequences in the PDB at the time of training using the ColabFold `colab_search` tool. You can setup your own local colabfold using instructions provided here: https://github.com/YoshitakaMo/localcolabfold
+
+The raw MSA's for the PDB can be found here:
+```
+wget https://boltz1.s3.us-east-2.amazonaws.com/rcsb_raw_msa.tar
+tar -xf rcsb_raw_msa.tar
+rm rcsb_raw_msa.tar
+```
+> Note: this file is 130G large, and will take another 130G to extract before you can delete the original tar archive, we make sure you have enough storage on your machine.
+
+You can also download the raw OpenFold MSA's here:
+```
+wget https://boltz1.s3.us-east-2.amazonaws.com/openfold_raw_msa.tar
+tar -xf openfold_raw_msa.tar
+rm openfold_raw_msa.tar
+```
+
+> Note: this file is 88G large, and will take another 88G to extract before you can delete the original tar archive, we make sure you have enough storage on your machine.
+
+If you wish to use your own MSA's, just ensure that their file name is the hash of the query sequence, according to the following function:
+```python
+import hashlib
+
+def hash_sequence(seq: str) -> str:
+    """Hash a sequence."""
+    return hashlib.sha256(seq.encode()).hexdigest()
+```
+
+#### Step 6: Process MSA's
+
+During MSA processing, among other things, we annotate sequences using their taxonomy ID, which is important for MSA pairing during training. This happens only on MSA sequences with headers that start with the following:
+
+```
+>UniRef100_UNIREFID
+...
+```
+
+This format is the way that MSA's are provided by colabfold. If you use a different MSA pipeline, make sure your Uniref MSA's follow the above format.
+
+Next you should download our provided taxonomy database and place it in the current folder:
+
+```bash
+wget https://boltz1.s3.us-east-2.amazonaws.com/taxonomy.rdb
+```
+
+You can now process the raw MSAs. First launch a redis server. We use redis to share the large taxonomy dictionary across workers, so MSA processing can happen in parallel without blowing up the RAM usage.
+
+```bash
+redis-server --dbfilename taxonomy.rdb --redis-port 7777
+```
+
+Please wait a few minutes for the DB to initialize. It will print `Ready to accept connections` when ready.
+
+> Note: You must have redis installed (see: https://redis.io/docs/latest/operate/oss_and_stack/install/install-redis/)
+
+In a separate shell, run the MSA processing script:
+```bash
+python msa.py --msadir YOUR_MSA_DIR --outdir YOUR_OUTPUT_DIR --redis-port 7777
+```
+
+> Important: the script looks for `.a3m` or `.a3m.gz` files in the directory, make sure to match this extension and file format.
+
+#### Step 7: Process structures
+
+Finally, we're ready to process structural data. Here we provide two different scripts for the PDB and for the OpenFold data. In general, we recommend using the `rcsb.py` script for your own data, which is expected in `mmcif` format.
+
+You can download the full RCSB using the instructions here:
+https://www.rcsb.org/docs/programmatic-access/file-download-services
+
+
+```bash
+wget https://boltz1.s3.us-east-2.amazonaws.com/ccd.rdb
+redis-server --dbfilename ccd.rdb --redis-port 7777
+```
+> Note: You must have redis installed (see: https://redis.io/docs/latest/operate/oss_and_stack/install/install-redis/)
+
+In a separate shell, run the processing script, make sure to use the `clustering/clustering.json` file you previously created.
+```bash
+python rcsb.py --datadir PATH_TO_MMCIF_DIR --cluster clustering/clustering.json --outdir YOUR_OUTPUT_DIR --use-assembly --max-file-size 7000000 --redis-port 7777
+```
+
+> Important: the script looks for `.cif` or `cif.gz` files in the directory, make sure to match this extension and file format.
+
+> We skip a few of the very large files, you can modify this using the `--max-file-size` flag, or by removing it.
+
+#### Step 8: Ready!
+
+You're ready to start training the model on your data, make sure to modify the config to assign the paths you created in the previous two steps. If you have any questions, don't hesitate to open an issue or reach out on our community slack channel.
diff --git a/scripts/process/README.md b/scripts/process/README.md
@@ -0,0 +1 @@
+Please see our [data processing instructions](docs/training.md).
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Please see our [data processing instructions](docs/training.md).