This README describes the analyses in:
Wang et al. Antigenic evolution of human influenza H3N2 neuraminidase is constrained by charge balancing. eLife 10:e72516 (2021)
This study aims to understand how epistasis influence NA antigenic evolution and characterize the underlying biophysical constraints. The repository here describes the analysis for the deep mutational scanning experiment that focuses on NA residues 328, 329, 344, 367, 368, 369, 370 in six different genetic backgrounds, namely A/Hong Kong/1/1968 (HK68), A/Bangkok/1/1979 (Bk79), A/Beijing/353/1989 (Bei89), A/Moscow/10/1999 (Mos99),A/Victoria/361/2011 (Vic11), and A/Hong Kong/2671/2019 (HK19).
- All raw sequencing reads, which can be downloaded from NIH SRA database PRJNA742436, should be placed in fastq/ folder. The filename for read 1 should match those described in ./data/SampleInfo.tsv. The filename for read 2 should be the same as read 1 except "R1" is replaced by "R2"
- ./data/SampleInfo.tsv: Describes the sample identity for each fastq file
- ./Fasta/RefSeq.fa: Reference (wild type) nucleotide sequences for the sequencing data
- ./Fasta/N2.fa:Reference (wild type) amino acid sequences for the sequencing data
- ./data/WTseq.tsv: Amino acids for the wild type sequences at residues 328, 329, 344, 367, 368, 369, 370
- ./Fasta/Human_H3N2_HA_2020.aln.gz: Full-length HA protein sequences from human H3N2 downloaded from GISAID
- ./Fasta/Human_H3N2_NA_2020.aln.gz: Full-length NA protein sequences from human H3N2 downloaded from GISAID
- ./script/fastq_to_fitness.py: Converts raw reads to variant counts and fitness measures.
- Input files:
- Raw sequencing reads in fastq/ folder
- ./data/SampleInfo.tsv
- ./Fasta/RefSeq.fa
- Output files:
- result/NA_Epi_*.tsv
- Input files:
- ./script/complie_fit_result.py: Complie variants info(amino acid;charge;fitness) in six different genetic background
- Input files:
- ./data/lib_variants.tsv
- ./Fasta/N2.fa
- ./data/WTseq.tsv
- result/NA_Epi_*.tsv
- Output files:
- Input files:
- ./script/NAEpi_PrefEvol.py: Amino acid sequences of NA antigenic region of interest in naturally occurring strains were extracted
- ./script/GE_regression.ipynb: Model training and robustness validation
- Input files:
- Output files:
- result/*_epi.csv
- result/*_add.csv
- ./script/GE_regression_v2.ipynb: Cross-validation and regularization
- Input files:
- Output files:
- ./script/Distance_CA.py: Calculate C alpha-alpha Distance within NA antigenic region
- Output files:
- ./script/ExtractSweep.py: Calculate all variant frequencies of each residue of H3N2 NA over year
- Input files:
- Output files:
- ./script/analyze_charge_natural_strain.py: Analyze antigenic region local charge in natural circulating strains
- Input files:
- Output files:
- ./script/mut_freq_ByYear.py: Analyze HA and NA accumulating mutation since 1968
- Input files:
- Output files:
- ./script/Coevolution_analysis_NA.ipynb: Analyze NA antigenic region charge state coevolution
- Input files:
- Output files:
- ./script/Evolution_model.py: Anlyze the evolutionary trajectory based on the fitness data
- Input files:
- Output files:
- ./result/trajectory_prediction_*.tsv
- ./script/Plot_Mutation_year.R: plot HA and NA accumulating mutation since 1968 (Supplementary Fig. 1)
- ./script/NA_epi.pml: plot the NA head domain (Fig. 1a)
- ./script/NA_epi_zoom.pml: plot the antigenic region of interest (Fig. 1b)
- ./script/TrackAAFreq.R:plot natural occurrence frequencies of the amino acid variants/charge state (Fig. 1c and Supplementary Fig. 13)
- Input files:
- Output files:
- ./script/Plot_CompareLib.R: plot fitness distribution and correlations of different background (Fig. 2a and b)
- Input files:
- Output files:
- ./script/Plot_Nat_motif_Freq.R:plot naturally occurring variant frequencies over year (Supplementary Fig. 2)
- ./script/Plot_CompareRep.R: plot the biological repeat correlation (Supplementary Fig. 3)
- ./script/Plot_TrackPref.R: plot naturally occurring variants fitness over year(Fig. 2c)
- Input files:
- Output files:
- ./script/Plot_NA_titer.R: plot virus rescue experiment of WT strains(Supplementary Fig. 4)
- ./script/Plot_hyperpar_R2.R: plot evaluation of model hyperparameters using repeated k-fold cross-validation (Supplementary Fig. 5&(./graph/hyperpar_r.png))
- ./script/Plot_add_heatmap.R: plot parameters for additive fitness in different genetic backgrounds (Fig. 3a)
- Input files:
- result/*_add.csv
- Output files:
- Input files:
- ./script/Plot_epi_heatmap.R: plot pairwise epistasis heatmap and epistasis classified by charge states (Fig. 3b and Supplementary Fig. 8; Fig. 4a and Supplementary Fig. 9)
- Input files:
- result/*_epi.csv
- Output files:
- Input files:
- ./script/Plot_CorEPI.R:plot correlation matrices of additive fitness and pairwise epistasis among six genetic backgrounds (Fig. 3c and Supplementary Fig. 6-7)
- Input files:
- result/*_add.csv
- result/*_epi.csv
- Output files:
- Input files:
- ./script/NA_epi_bind.pml: plot the NA antigenic region interaction (Fig. 4b and Supplementary Fig. 10)
- ./script/Plot_charge_vs_fit.R: Plot variant fitness with local net charge (Fig. 4c and Supplementary Fig. 12)
- ./script/Plot_Distance_vs_EPI: plot Cα-Cα distances and epistasis
- Input files:
- ./result/CA_distance.tsv
- result/*_epi.csv
- Output files:
- Input files:
- ./script/plot_charge_natural_strain.R: plot the evolution of local net charge at the NA antigenic region (Fig. 5a)
- Input files:
- Output files:
- ./script/Plot_epi_heatmap_bycharge.R: plot pairwise epistasis of charge states (Fig. 5b and Supplementary Figure 14)
- Input files:
- result/*_epi.csv
- Output files:
- Input files:
- ./script/Plot_epi_vs_Coevol.R:plot relationship between coevolution score and pairwise epistasis (Fig. 5c and Supplementary Fig. 16)
- Input files:
- ./result/Coevols.csv
- result/*_epi.csv
- Output files:
- Input files: