Skip to content

Latest commit

 

History

History

Ab_epitope

header

Memory-B-cells Language Model (mBLM) for antibody epitope prediction guidelines

This README describes the mBLM in the paper: paper link

Contents

Env setup

if you set up env using conda, run conda installation as follow:

conda env create -f environment.yml

Dataset download and pre-process

Dataset

dataset for memory-B-cells Language Model

We downloaded and processed all OAS memory paired B cell seuqences from OAS. Then, heavy chain was clustered by 95% sequence identity using cdhit. Sequence representative was chosen from each cluster. For training and test purpose, the dataset was splitted by different sequence identity (50%, 60%, 70%, 80%, 90%).

./data_clean_for_LM.sh

dataset for antibody epitope prediction

Note: for epitope prediction, heavy chain sequence was used only. Firstly, we remove "unknown" sequence that is similar to annotated sequence (sequence identity 0.9), and then define "unknown" as "Others". Then, we split into train/test/val by sequence identity at maximum 80% (>= 26 AA differences).

./data_clean_for_epitope.sh

Train memory B cell Language Model

mBLM was adapted from RoBERTa model RoBERTa: A Robustly Optimized BERT Pretraining Approach. model and training details see paperpaper link.

python train_LM.py

Epitope prediction

we then fine-tuned mBLM for multi-epitopes prediction.

extract mBLM embedding features

python extract_mBLM_feature.py --model_location mBLM --fasta_file result/epitope_info_clstr_v2.fasta --output_dir result/mBLM_embedding

benchmarking

python ./Epitope_Clsfr/train.py -n onehot_baseline -lm onehot -hd 26
python ./Epitope_Clsfr/train.py -n esm2_attention -lm esm2_t33_650M_UR50D -hd 1280
python ./Epitope_Clsfr/train.py -n mBLM_attention -lm mBLM -hd 768

model test and predict

python ./Epitope_Clsfr/test.py -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt

python ./Epitope_Clsfr/predict.py -dp result/Flu_unknown.csv -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt

python ./Epitope_Clsfr/predict.py -dp result/Sarah_stem_antibodies.xlsx -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt

Antibody binding sites identification

In order to investigate how model make a accurate prediction, Grad-CAM https://arxiv.org/abs/1610.02391 technique was used to show model capturing binding sites in a residues level.

Grad-CAM

python ./Epitope_Clsfr/explain.py -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt

python ./Epitope_Clsfr/explain.py -dfp result/Flu_unknown.csv -n mBLM_attention -lm mBLM -hd 768 -ckn mBLM.ckpt -o result/Flu_unknown_explain/ --provide_dataset

Visualization

python script/write_pdb_visualizer.py

python script/pymol_viz.py

python script/cluster_saliency_map.py