Welcome to the GitHub repository for: Deep indel mutagenesis reveals the impact of insertions and deletions on protein stability and function. https://www.biorxiv.org/content/10.1101/2023.10.06.561180v1
- 1. Required Software
- 2. Required Data
- 3. Installation Instructions
- 4. Usage
To run the deep_indel_mutagenesis pipeline you will need the following software and associated packages:
- R (dplyr, modeest, stringr, strex, data.table, bio3d, seqinr, ggplot2, ggridges, GGally, gridExtra, viridis, writexl, matrixStats, vecsets, grpreg, glmnet, glinternet)
- Functions to run the pipeline you will need to download the "Functions" folder, containing custom functions made to process the data for deep_indel_mutagenesis pipeline
-
i) Predictions from CADD, DDMut, ESM1v, PROVEAN, and ESM1b ii) STRIDE and rSASA information, iii) the INDELi model predictions for the Tsuboyama data, together with vi) additional dfs neccesery to run the pipleline should be downloaded as "additional_dfs.rds" from here
-
DiMSum files neccesery for running the deep_indel_mutagenesis pipeline should be downloaded from here. If you want to re-run DiMSum you will also find the neccessery scripts ("VariantIdentity", "ExperimentalDesign" etc) in the same folder.
-
Tsuboyama et al. 2023 raw data ("Tsuboyama2023_Dataset2_Dataset3_20230416.csv") and the pdb files ("AlphaFold_model_PDBs") should be downloaded here
-
Pre-processed data for reproducing the figures can also be downloaded from here
Make sure you have git and conda installed and then run (expected install time <5min):
# Install dependencies (preferably in a fresh conda environment)
conda install -c conda-forge r-base r-dplyr r-modeest r-stringr r-strex r-data.table r-bio3d r-seqinr r-ggplot2 r-ggridges r-GGally r-gridExtra r-viridis r-writexl r-matrixStats r-vecsets r-grpreg r-glmnet r-glinternet
1. To re-produce the figures:
-
Download and unzip additional_files.zip, DiMSum.zip, pre_processed_data.zip and indel_prediction_models.zip from here. Also, download the Functions folder neccessery to execute the scripts.
-
000_load_functions In stage 00 of the pipeline, we load and set folder locations for the required data (downloaded above) and load the required functions from the functions folder.
-
01_split_data In stage 01 of the pipeline, we process the raw DiMSum files and call the indel and substitution variants. Furthermore we process the Tsuboyama et al. 2023 data set for further analysis. In this script you have an option to either process the data yourself (PART1, PART2 and PART3) or directly load the processed data frames for further analysis (skip to PART4 and download pre-processed data).
-
002_figure1_main Reproduce Fig. 1
-
003_figure2_main Reproduce Fig. 2
-
004_figure1_extended Reproduce Extended Fig. 1
-
005_figure3_main Reproduce Fig. 3
-
006_figure2_extended Reproduce Extended Fig. 2
-
007_figure4_main Reproduce Fig. 4
-
008_figure5_main Reproduce Fig. 5
-
009_figure3_extended Reproduce Extended Fig. 3
-
010_figure6_main Reproduce Fig. 6
-
011_figure7_main Reproduce Fig. 7
-
012_figure4_extended Reproduce Extended Fig. 4
-
013_figure8_main Reproduce Fig. 8
-
014_figure5_extended Reproduce Extended Fig. 5
000_load_functions and 01_split_data should be run first.
2. To run the genome-wide predictions for the human proteome:
We provide the README instructions and scripts used to run the genome-wide prediction using INDELi-E in the folder genome_wide_prediction_INDELi_E.
3. Use INDELi-E model for single proteins:
Alternativly, we provide the code to run the pre-trained INDELi-E model to predict stability effects of 1aa deletions and insertions in your protein of interest. Avaliable in the folder single_protein_prediction.