This README describes the analyses in:
A large-scale systematic survey reveals recurring molecular features of public antibody responses to SARS-CoV-2
- Local igblast setup
- Analysis of VDJ gene usage and V gene pairing
- Baseline VDJ setup
- CDR H3 clustering analysis
- Identification of recurring somatic hypermutation (SHM)
- Analysis of recurring SHM in IGHV1-58/IGKV3-20 antibodies
- Clonotype assignment
- Deep learning model for antigen identification
- Plotting
- ./data/SARS-CoV-2-Abs.xlsx: List of antibodies collected from publications and patents
- ./data/HV_Repertoire_freq.xlsx: Baseline IGHV usage (see Baseline VDJ setup)
- ./data/LV_Repertoire_freq.xlsx: Baseline IGK(L)V usage (see Baseline VDJ setup)
- ./data/D_Repertoire_freq.xlsx: Baseline IGHD usage (see Baseline VDJ setup)
Install dependencies by conda:
conda create -n Abs -c bioconda -c anaconda -c conda-forge \
python=3.9 \
biopython \
pandas \
openpyxl \
distance \
logomaker \
igblast \
anarci \
mafft \
fasttree
Before analysis, do:
conda activate Abs
pip3 install crowelab_pyir
Database set up in pyir library directory
pyir setup
-
Sequence download from http://www.imgt.org/vquest/refseqh.html#VQUEST
-
Copy and paste, save as fasta (save all V gene in one file; all D gene in one file; all J gene in one file)
-
Clean data (raw edit_imgt_file.pl can be found on igblast-1.17.1xxx/bin)
edit_imgt_file.pl imgt_database/human_prot/imgt_raw/IGV.fasta > imgt_database/human_prot/IGV.fasta
- Create database (use "-dbtype prot" for protein sequence, use "-dbtype nucl" for DNA sequence). For example:
makeblastdb -parse_seqids -dbtype prot -in imgt_database/human_prot/IGV.fasta
makeblastdb -parse_seqids -dbtype nucl -in imgt_database/human_nuc/IGV.fasta
- Run PyIR for igBlast
- see PyIR.py
- Run local igblast on kabat numbering system
igblastn -query result/test.fasta \
-germline_db_V imgt_database/human_nuc/IGV.fasta \
-germline_db_J imgt_database/human_nuc/IGJ.fasta \
-germline_db_D imgt_database/human_nuc/IGD.fasta \
-organism human -domain_system kabat \
-auxiliary_data imgt_database/optional_file/human_gl.aux \
-out result/igblast_output
- Parse igblast output
- Using CDR_Parser.py for igblast_output
-
Download antibody repertoire data for healthy donors from cAb-Rep
-
Cal_repertoire_freq.py is used to establish the baseline germline usage frequency
- Extract VDJ gene usage
python3 code/VDJgene_freq_analysis.py
- Input file:
- Output files:
-
Extract information from the antibody dataset for downstream analyses
python3 code/parse_Ab_table.py
- Input file:
- Output files:
-
Clustering CDR H3 sequences
python3 code/CDRH3_clustering_optimal.py
- Input file:
- Output file:
-
Analyzing CDR H3 clustering results
python3 code/analyze_CDRH3_cluster.py
-
Numbering SARS2 antibody sequences according to Kabat numbering
ANARCI --scheme kabat --csv -i Fasta/SARS-CoV-2-Ab.pep -o result/SARS-CoV-2-Ab
- Input file:
- Output files:
-
Numbering germline sequences according to Kabat numbering
ANARCI --scheme kabat --csv -i imgt_database/human_prot/IGV.fasta -o result/Human_IGV_gene_kabat_num
- Input files:
- Output files:
-
Calling SHMs
python3 code/SHM_analysis.py
-
Extracting light chain sequences from IGHV1-58/IGKV3-20 antibodies
python3 code/extract_IGHV1-58_VL.py
- Input file:
- Output file:
-
Multiple sequence alignment
mafft Fasta/Cluster3_H158K320_LC.pep > Fasta/Cluster3_H158K320_LC.aln
-
Identifying amino acid variants at VL residues 29 and 92
python3 code/extract_IGHV1-58_aa.py
- Input file:
- Output file:
-
Constructing phylogenetic tree
FastTree Fasta/Cluster3_H158K320_LC.aln > result/Cluster3_H158K320_LC.tree
- Antibodies with the same IGHV, IGK(L)V, IGHJ, IGK(L)J, and belong to the same CDR H3 cluster will be assigned to the same clonotype
- Input files:
- Output file:
Deep learning model is under CoV_Encoder
- Classifier for S vs HA: ./code/CoV_Encoder/S_HA_classifier.ipynb
- Classifier for RBD vs HA: ./code/CoV_Encoder/RBD_HA_classifier.ipynb
- Classifier for RBD vs NTD vs S2: ./code/CoV_Encoder/RBD_S2_NTD_classifier_VH.ipynb
- Input files:
- Output files:
-
Plot summary statistics
Rscript code/Plot_summary_Piechart.R
- Input file:
- Output file:
-
Plot VDJ gene usage
Rscript code/Plot_VDJgene_Freq.R
-
plot IGHV/IGK(L)V pairing frequency
Rscript code/Plot_point_heatmap.R
-
Plot CDR H3 cluster size
Rscript code/plot_CDRH3_cluster_summary.R
- Input file:
- Output file:
-
Generate sequence logo for different CDR H3 clusters
python3 code/CDRH3_seqlogo.py
- Input file:
- Output files:
- ./CDRH3_seqlogo/*.png
-
Plot IGHV gene usage for CDR H3 cluster 7
Rscript code/plot_cluster7_Vgenes.R
- Input file:
- Output file:
-
Plot analysis results for IGHD1-26-encoded S2 antibodies
Rscript code/plot_IGHD1-26_analysis.R
-
Plot SHM
Rscript code/plot_SHM.R
- Input file:
- Output files:
-
Plot the number of neutralizing vs non-neutralizing antibodies in each IGHV/IGK(L)V pair
Rscript code/Plot_basic_stat.R
- Input file:
- Output file:
-
Plot phylogenetic tree for the light chains of IGHV1-58/IGKV3-20 antibodies
Rscript code/plot_tree_KV320.R
- Input files:
- Output file: