This README describes the mBLM in the paper: paper link
- Env setup
- Dataset download and process
- Train memory B cell Language Model
- Epitope prediction
- Antibody binding sites identification
conda env create -f environment.yml
- OSA human paired memory B cell: https://opig.stats.ox.ac.uk/webapps/oas/oas_paired/
- paired antibodies from Genbank: Genbank
- Flu Antibody dataset in this paper: Flu
- SARS-CoV-2 and HIV Antibody dataset in this paper: SARS-CoV-2
- Final dataset used for epitope prediction: epi_data
We downloaded and processed all OAS memory paired B cell seuqences from OAS. Then, heavy chain was clustered by 95% sequence identity using cdhit. Sequence representative was chosen from each cluster. For training and test purpose, the dataset was splitted by different sequence identity (50%, 60%, 70%, 80%, 90%).
./data_clean_for_LM.sh
Note: for epitope prediction, heavy chain sequence was used only. Firstly, we remove "unknown" sequence that is similar to annotated sequence (sequence identity 0.9), and then define "unknown" as "Others". Then, we split into train/test/val by sequence identity at maximum 80% (>= 26 AA differences).
./data_clean_for_epitope.sh
mBLM was adapted from RoBERTa model RoBERTa: A Robustly Optimized BERT Pretraining Approach. model and training details see paperpaper link.
python train_LM.py
we then fine-tuned mBLM for multi-epitopes prediction.
python extract_mBLM_feature.py --model_location mBLM --fasta_file result/epitope_info_clstr_v2.fasta --output_dir result/mBLM_embedding
python ./Epitope_Clsfr/train.py -n onehot_baseline -lm onehot -hd 26
python ./Epitope_Clsfr/train.py -n esm2_attention -lm esm2_t33_650M_UR50D -hd 1280
python ./Epitope_Clsfr/train.py -n mBLM_attention -lm mBLM -hd 768
python ./Epitope_Clsfr/test.py -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt
python ./Epitope_Clsfr/predict.py -dp result/Flu_unknown.csv -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt
python ./Epitope_Clsfr/predict.py -dp result/Sarah_stem_antibodies.xlsx -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt
In order to investigate how model make a accurate prediction, Grad-CAM https://arxiv.org/abs/1610.02391 technique was used to show model capturing binding sites in a residues level.
python ./Epitope_Clsfr/explain.py -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt
python ./Epitope_Clsfr/explain.py -dfp result/Flu_unknown.csv -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt -o result/Flu_unknown_explain/ --provide_dataset
python script/write_pdb_visualizer.py
python script/pymol_viz.py
python script/cluster_saliency_map.py