This is the source code of paper: "Persistent spectral theory-guided protein engineering" by Yuchi Qiu, and Guo-Wei Wei. Nature Computational Science.
TopFit is a persistent spectral theory (PST)-based machine learning model that navigates protein fitness landscape. It integrates with sequence-based features and it is equipped with an ensemble regression. The PST-based features capture the mutation-induced local 3D structure changes. Whereas sequence-based models can deal with more general cases without resorting to 3D structures. The ensemble regression integrates various regression models automatically optimized by Bayesian optimization to have strong generalization for various sizes of training data.
The major modules of TopFit are implemented in Python3.6.6 in virtualenv with pacakges:
- pytorch=1.10.0
- scikit-learn=0.23.1
- xgboost=1.5.0
- scipy=1.4.1
- biopython=1.79
- numpy=1.19.5
- h5py=2.10.0
- hyperopt=0.2.2
- pandas=1.1.2
- pickle=4.0
- tqdm=4.32.1
- gudhi=3.3.0
The above pacakges are required by modules including PST, TAPE, ESM, and ensemble regression. The installation time is approaximately a few minutes.
In additional to packages listed above, modules below need extra softwares or packages:
PST module requires VMD and Jackal for structure data processing. Please install them and add the paths of their excutable files in software_path()
function in src/filepath_dir.py
. The high-dimensional Laplacians are not used in our default setting, but it is available. Please install HERMES and update the path of its excutable file in software_path()
to allow calculation of high-dimensional Laplacians.
Follow the instruction to install TAPE,and download pretrain weights. tensorflow=1.13.0 also needs to be installed. Please revise the path of its pretrain parameters in TAPE_MODEL_LOCATIONS
in src/filepath_dir.py
, its default value is tape-neurips2019/pretrained_models/
.
Follow the instruction to install ESM, and download pretrain weights for both esm1b and esm1v models.
The two modules below require different environment than above. Please create the conda environment below when using them.
- DeepSequence VAE is implemented by python2.7. Built the conda environment for it:
conda env create -f deep_sequence.yml
. Also follow the instruction to install this repo, and write the repo path in variableWORKING_DIR
in bothsrc/vae_train.py
andsrc/vae_inference.py
. This package is implemented by THEANO, please set up the CUDA environment for it to accelerate the VAE model training. - eUniRep and UniRep: we implemented it using codes from this work using tensorflow2.2.0. Build the conda environment:
conda env create -f protein_fitness_prediction.yml
. Please revise the path of its pretrain parameters inUNIREP_PATH
insrc/filepath_dir.py
, its default value isunirep_weights/
First, run src/PST_embedding.py $dataset $a $b
to generate embedding for dataset given in data/$dataset/data.csv
. The first parameter $dataset
is name of dataset. The second $a
and third parameter $b
give the range of mutational entries to run.
For example, run
dataset=YAP1_HUMAN_Fields2012-singles-linear
a=0
b=319
python3 src/PST_embedding.py $dataset $a $b
The example run PST on entire dataset with index from 0 to 319. The range can be broken down to run one entry a time for parallelization by taking a=n
and b=n+1
for each job. Usually, one entry takes a few seconds to run. After PST/generatePST.py
traverse all entries in the dataset, run
dataset=YAP1_HUMAN_Fields2012-singles-linear
python src/merge_PST.py $dataset
to collect feature matrix for the dataset in dimension Features/$dataset/unnorm/
, and the standardized features using StandardScaler()
in scikit-learn is stored in Features/$dataset/norm/
. The regression model uses standardized features in default. PST.npy
and PH.npy
are for persistent spectral theory and persistent homology features, respectively.
--feature_list
has more options for L1, L2 features. The default option calculates the necessary features for our PST and PH features in main text.
The embedding feature matrix is stored in Features/$dataset/unnorm/
, and its standarized matrix is stored in Features/$dataset/norm/
. The embedding matrix has dimension inference/$dataset/
. Each entry in the dataset has a score. Many functions for this module were rewritten from this work
Please install ESM and download the pretrained model weights. We implemented both esm1b and esm1v models. The esm1v provides five models from different random seeds, and we picked the first model. The embedding generation can be run, for example:
python src/esm_embedding.py $dataset $model
$model
can be either esm1b
or esm1v
. The output file is stored in Features/$dataset/norm/$model.npy
The evoultionary score can be run:
python src/esm_inference.py $dataset --model $model
The output file is stored in inference/$dataset/$model_pll.csv
tensorflow=1.13.0 needs to be installed. Please install TAPE and download the pretrained model weights.
The embedding can be run:
model=lstm
python src/tape_embedding.py $dataset --model $model
Five models are implemented in TAPE for $model
: resnet
, bepler
, unirep
, transformer
, lstm
. The file name is tape_$model.npy
The UniRep is implemented in TAPE also the UniRep package below. We generate both of them. But we only test the original UniRep embedding in this work.
The embedding can be run:
python src/eunirep_embedding.py $dataset --model $model
For UniRep, take model=gunirep
. For eUniRep, take model=eunirep
.
The evolutionary score can be run, for example,
python src/eunirep_inference.py $dataset --model $model
For UniRep, take model=gunirep
. For eUniRep, take model=eunirep
.
The fine-tune eUniRep model is trained on MSA of the target protein. The pretrain weights are stored in unirep_weights/$uniprot/
after training. To train it:run
uniprot=YAP1_HUMAN
seqs_fasta_path=alignments/$uniprot.a2m
save_weights_dir=unirep_weights/$uniprot
python src/unirep_evotune.py seqs_fasta_path save_weights_dir
$seqs_fasta_path
is the MSA file in .a2m
format. $save_weights_dir
is the directory to save the pretrain weights. NOTICE: the fine-tune process is very slow and require GPU and huge memory. We run it on a NVDIA-v100 GPU node, allocate 100G memory, and a 2 days limits for the job.
The VAE is trained on MSA of the target protein by running:
uniprot=YAP1_HUMAN
THEANO_FLAGS='floatX=float32,device=cuda' python src/vae_train.py alignments/"$uniprot".a2m "$uniprot"_seed$seed $seed
The pretrained weights are stored in params/
located in the directory of DeepSequence repo.
To obtained the evolutionary score, run:
dataset=YAP1_HUMAN_Fields2012-singles-linear
uniprot=YAP1_HUMAN
THEANO_FLAGS='floatX=float32,device=cuda' python src/vae_inference.py $uniprot data/"$dataset"/seqs.fasta data/"$dataset"/wt.fasta inference/"$dataset"/ --seed $seed
$seed
is the random seed. We run 5 random seed with values 1,2,3,4,5
. The resulting evolutionary scores are stored in inference/$dataset/
with name elbo_seed$seed.npy
. After obtained scores from all random seeds, run python src/merge_elbo.py
to average the predictions from all seeds and saved in elbo.npy
.
In addition, evolutionary scores for the most of datasets can be obtained from DeepSequence work directly. They are stored in inference/$dataset/vae_elbo.csv
.
The ensemble regression integrates multiple regressors with optimized hyperparameters. The top performing models are selected and their predictions are averaged to get the ensemble. To run the regression, first prepare the list of models. An example with full list of available models can be found in Inputs/RegressorPara.csv
. Model type ANN_deep
has more hidden units at each layer than model type ANN
. Demo using ridge regression is Inputs/RegressorPara_Ridge.csv
. The implementation of ensembel regression was rewritten from the codes in MLDE package.
For example, run a Demo (running within 5 mintues) as the following:
dataset=YAP1_HUMAN_Fields2012-singles-linear
encoder=vae+PST+esm1b
n_train=240
reg_para=Inputs/RegressorPara_Ridge.csv
python src/evaluate_singleprocess.py $dataset $encoder --seed $seed --n_train $n_train --reg_para $reg_para
$encoder
is the feature name. The different types of features are seperated by +
. For example, vae+PST+esm1b
is the feature combining vae
, PST
, and esm1b
. $seed
is the random seed for train/test set split. $n_train
is the number of training data, -1
indicates 80/20 split. --reg_para
is the file for model list.
To run 5-fold cross validation, use src/evaluate_cv5.py
instead.
There are two types of output.
- The result is saved in directory
results/$dataset/
for various evaluating metrics including spearman and NDCG. Results from different seeds are saved in distinct.csv
files with unique ID. It avoids conflicts when one tries to run multiple seeds in parallel. After all seeds are finished, runsrc/merge_results.py $dataset
to merge all result files into a unique.csv
file. - The other output can be a point-wise predictions on each single entry saved in
.npy
format. Need to use the--save_pred
option to get the detailed predictions for each entry in the testing set. The ensemble regression on NMR can average the predictions from multiple NMR models. For detailed descriptions please run
python src/evaluate_singleprocess.py --help
Users can create their own data to run. Please follow the steps to create necessary inputs.
- Find the UniProtID (
$uniprot
), a high quality structure data ($PDB
) of the target dataset ($dataset
). Determine the chain name for the corresponding sequence ($chain
). Find the resolution of the PDB data ($resolution
) and the experimental method ($method
). If it is a NMR structure, find the number of structure models ($n_structure
) given in the NMR data. Then add a new entry in several dictionaries insrc/filepath_dir.py
:
-
Chain_dir
:key=$uniprot
,value=$chain
. -
Uniprot_to_PDB
:key=$uniprot
,value=$PDB
. -
dataset_to_uniprot
:key=$dataset
,value=$uniprot
-
STUCTURE_TYPE
:key=$dataset
,value=$method
-
STUCTURE_RESOLUTION
:key=$dataset
,value=$resolution
- (if
$method
is NMR)dataset_to_n_structure
:key=$dataset
,value=$n_structure
- Create sequence data in
data/$dataset/
. Refer to the format in the given sample
-
data.csv
: DMS dataset (wildtype sequence is excluded) with columns:-
seq
sequence for each mutation. -
log_fitness
: fitness value. -
n_mut
: number of mutational sites. -
mutant
: mutation information is sperated by comma (,
), for example,T25R,E56L
means there are two mutational sites, the first mutation is on residue 25 with wildtype residue T and mutant residue R, the second mutation is on residue 56 with wildtype residue E and mutant residue L. -seqs.fasta
: the list of sequences in the order ofseq
column indata.csv
using fasta format. For each sequence, it takes two lines: first line uses format>id_$id
($id
is the index for the sequence) and second line is sequence.
-
-
wt.fasta
: wildtype sequence in fasta format.
- Create optimized structure data in
structure_dataset/$dataset/processed_PDB/
.
- First, download the PDB file (
$PDB.pdb
file) from PDB database or alphafold tostructure_dataset/$dataset/original_PDB/
. And add thewt.fasta
for wildtype sequence and$PDB.fasta
sequence file of the download PDB entry tooriginal_PDB/
. - Use Jackal and VMD to optimize and revise the structure. Especially, when the
wt.fasta
and$PDB.fasta
are not identical, the structure data needs to be mutated to match the wildtype sequence. Follow the manuscript description about the structure optimization. The precise processing functions are provided for each dataset:structure_dataset/$dataset/get_PDB.py
. Please refer to our paper for the structure optimization details.
- Create alignment file stored in
alignments/$dataset/
. We use EVcoupling server to generate.a2m
MSA file. We search overUniRef100
database. TheBitscore
is adjusted to ensure at least 70% residues are covered. See the manuscript or the DeepSequence paper for details of MSA creation. - The pretrain weights for fine-tune eUniRep models are stored in
unirep_weights/$uniprot/
. PST, PH and sequence-based embedding are all stored inFeatures/$dataset/
. The evolutionary scores are
Run ./download_data_demo.sh
to download input and output data for YAP1_HUMAN
dataset.
All source data are available in this link. Data.tar.gz
contains sequence data, structure data, evolutionary scores, and alignment files for all datasets. Features.tar.gz
contains normalized embedding features for all datasets. unirep_weights.tar.gz
contains pretrain and fine-tune weights for eUniRep model.
[1] Persistent spectral theory-guided protein engineering", by Yuchi Qiu and Guo-Wei Wei, Nature Computational Science.
[2] Learning protein fitness models from evolutionary and assay-labelled data, Nature Biotechnology 2022 (Model integrates multiple sequence-based features)
[3] Informed training set design enables efficient machine learning-assisted directed protein evolution, Cell Systems 2021 (A work uses ensemble regression)
[4] Deep generative models of genetic variation capture the effects of mutations, Nature Methods, 2018 (DeepSequence VAE)
[5] Low-N protein engineering with data-efficient deep learning, Nature Methods, 2021 (eUniRep model)
[6] Evaluating Protein Transfer Learning with TAPE, Arxiv 2019 (TAPE model)
[7] Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, PNAS 2021 (ESM1b Transformer)
[8] Language models enable zero-shot prediction of the effects of mutations on protein function, BioRxiv, 2021 (ESM1v Transformer)