ProtFill
is an inpainting protein sequence and structure co-design model that works on antibodies as well as other proteins.
Our model uses custom GVPe message passing layers, which are a modification of GVP with edge updates.
cd protfill
conda create --name protfill python=3.10
conda activate protfill
python -m pip install .
python -m pip install torch_geometric torch_scatter
The datasets can be downloaded from proteinflow
.
proteinflow download --tag 20230102_stable
proteinflow download --tag 20230626_sabdab --skip_splitting
rm -r data/proteinflow_20230626_sabdab/splits_dict/
cp -r data/splits_dict data/proteinflow_20230626_sabdab/
proteinflow split --tag 20230626_sabdab
There are four models in this repository and they can be tested or replicated with corresponding config files. The differences between the models are explained in the table below. Noising scheme here refers to either replacing the masked data with samples from a gaussian distribution (standard) or corrupting it with noise (alternative).
Name | Dataset | Diffusion | Noising scheme |
---|---|---|---|
protfill_ab | antibody | no | standard |
proftilldiff | antibody | yes | standard |
protfill_ppi_standard_noising | diverse | no | standard |
protfill_ppi_alternative_noising | diverse | no | alternative |
In order to retrain one of the models, run this command with one of the config names.
protfill --config configs/train/NAME.yaml --dataset_path DATASET_PATH
An example can look like this.
protfill --config configs/train/protfill_ab.yaml --dataset_path data/proteinflow_20230626_sabdab
In order to test one of our pre-trained models on the 'easy' test subset, run the following.
protfill --config configs/test/NAME.yaml --dataset_path DATASET_PATH --easy_test
To test on the 'hard' subset, replace --easy_test
with --hard_test
. To test on a specific CDR, add i.e. --redesign_cdr H3
. Note that the 'hard' antibody subset does not contain light chains and the diverse dataset does not have CDRs or an 'easy' test subset.
To redesign a part of a new file, run this. The file can have either a .pdb
or a .pickle
extension, with the pickle files being generated by proteinflow
.
protfill --config configs/test/NAME.yaml --redesign_file 7kgk.pdb
By default this command will redesign a random part of the protein. To redesign specific positions, use the --redesign_positions
option. This argument should be in the format of chain:start1-end1,start2-end2
, e.g. A:5-10,20-21,30-40
. The numbering is 0-indexed, the starts are included in the selected slice and the ends are not. In case of PDB files, the chain name is the author name. In case of pickle files, the numbering should be based on the fasta chain. If the file was generated with proteinflow
with CDR information, this can also be used with a --redesign_cdr CDR
option to redesign a specific CDR, e.g. --redesign_cdr H3
.