It is a upgraded version of cluster learning-assisted directed evolution (CLADE). Evolutionary scores are used to initiate sampling in CLADE 2.0. Later sampling iteratively uses the available data to update sampling probability and clustering architecture. The last step of CLADE 2.0 uses MLDE to exploit fitness.
First download this repo.
Then install MLDE for supervised learning model:
cd CLADE/
git clone --recurse-submodules https://github.com/fhalab/MLDE.git`
Other packages required:
- Python3.6 or later.
- scikit-learn
- numpy
- pandas
- pickle
Input data for dataset $dataset
needs to be stored in Input/$dataset/
. Run ./download_data_demo.sh
to obtain all required input data. They include:
-
$dataset.xlsx
: data for sequences and their experimental fitness. -
$dataset_$encoding.npy
: feature matrix that encodes the sequences. Currentencoding
options:AA
andGeorgive
for physiochemical encoding, andzero
for ensemble evolutionary score. -
ComboToIndex_$dataset.pkl
: dictionary file map sequence to its id given in the order given in$dataset.xlsx
file. -
Mldeparameter.csv
: parameters for MLDE, stored inInput/
-
DeepSequence VAE is implemented by python 2.7 using THEANO. Use conda environment
scirpts/deep_sequence.yml
to set up the environment. Script for training VAE model is given inscripts/script_vae_train.sub
. Script for caculation ELBO score is given inscripts/script_vae.sub
. -
EVmutation and MSA Transformer are obtained from implementation in MLDE.
-
HMM scores can be generated using
scripts/script_hmm.sub
-
ESM-1v score can be generated using
scripts/script_esm.sub
clustering_sampling.py
Use hierarchical clustering to generate training data.
$ python3 clustering_sampling.py --help
K_increments
Increments of clusters at each hierarchy; Input a list; For example: --K_increments 10 0 10 10.\
--dataset DATASET
Name of the data set. Options: 1. GB1; 2. PhoQ.
--encoding_ev ENCODING_EV
encoding method used for initial sampling; Default: zero"
--encoding ENCODING
encoding method used for late-stage sampling and supervised model; Option: 1. AA; 2. Georgiev. Default: AA
--num_first_round NUM_FIRST_ROUND
number of variants in the first round sampling; Default: 96
--batch_size BATCH_SIZE
Batch size. Number of variants can be screened in parallel. Default: 96
--hierarchy_batch HIERARCHY_BATCH
Excluding the first-round sampling, new hierarchy is generated after every hierarchy_batch variants are collected until max hierarchy. Default:96
--num_batch NUM_BATCH
number of batches; Default: 4
--input_path INPUT_PATH
Input Files Directory. Default 'Input/'
--save_dir SAVE_DIR
Output Files Directory; Default: current time \
InputValidationData.csv
: Selected labeled variants. Training data for downstream supervised learning. Default will generate 384 labeled variants with batch size 96.
clustering.npz
: Indecis of variants in each cluster.
In our work in CLADE 2.0, we always set the second K_increment as 0. In that case, the first round and the second round of sampling are performed on the same clusters to enhance the accuracy. The sampling probabilities in first round are driven by evolutionary scores, and that in the second round are given by the labeled data fitness:
python3 cluster_sampling.py 10 0 10 10
CLADE2.py
Run full process of CLADE. Run cluster_sampling.py
and downstream supervised learning (MLDE).
It requires the same positional and optional arguments with cluster_sampling.py
.
It has an additional optional argument:
--mldepara MLDEPARA
List of MLDE parameters. Default: MldeParameters.csv
In additional to three output files from cluster_sampling.py
, there are 6 files output from MLDE package. The most important one is: PredictedFitness.csv
showing predicted fitness of all variants in the combinatorial library. The variants with higher predicted fitness have higher priority to be screened.
python3 CLADE2.py 10 0 10 10 --batch_size 96 --num_first_round 96 --hierarchy_batch 96 --num_batch 4
python3 CLADE.py 10 0 10 10 --batch_size 96 --num_first_round 96 --hierarchy_batch 96 --num_batch 4 --mldepara Demo_MldeParameters.csv
GB1 dataset (GB1.xlsx
) can be obtained from: Wu, Nicholas C., et al. "Adaptation in protein fitness landscapes is facilitated by indirect paths." Elife 5 (2016): e16965.
PhoQ dataset (PhoQ.xlsx
) is owned by Michael T. Laub's lab. Please cite: Podgornaia, Anna I., and Michael T. Laub. "Pervasive degeneracy and epistasis in a protein-protein interface." Science 347.6222 (2015): 673-677.
The supervised learning package MLDE and zero-shot predictions can be found in: Wittmann, Bruce J., Yisong Yue, and Frances H. Arnold. "Informed training set design enables efficient machine learning-assisted directed protein evolution." Cell Systems (2021).
It can be found here with paper: Qiu Yuchi, Jian Hu, Guo-Wei Wei, "Cluster learning-assisted directed evolution" Nature Computational Science (2021).
This work is under review.