This repository contains results of the semester project "Detecting novel molecular events in proteomics data for genetic diagnostics" in Bioinformatics Institute.
The aim of this project was to improve diagnostics of Mendelian disorders via proteomics.
- To review existing literature on the subject.
- To acquire expertise in different methods of detection of expression outliers detection
- To modify these methods for detection of aberrant expression of protein complexes
- To acquire expertise in gene set enrichment analysis (GSEA)
- To perform benchmarking of the different combinations of methods and sorting parameters used in GSEA
- To find optimal combination of the aforementioned methods and parameters
We analyzed data of 147 patients, for 61 of which one of the genes containing disease-causing mutation was determined earlier. We used CORUM database of protein complexes avaliable at http://mips.helmholtz-muenchen.de/corum/#download. For detection of aberrant expression, two methods were used. First of them, LIMMA, implies normalization on the control sample. Second, PROTRIDER, normalizes the data using autoencoder. To determine the enriched protein complexes and their function, we used gene set enrichment analysis (GSEA) implemented in R package clusterProfiler
. It is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes) - see more detailed information at https://www.gsea-msigdb.org/gsea/index.jsp
The commands and examples mentioned in this README have been tested on x86_64 Ubuntu 18.04 LTS with Intel(R) Core(TM) i7-3630QM CPU, 8 Gb system memory using R version 3.6.3 with the following packages:
data.table
v1.13.0tidyr
v1.1.2dplyr
v1.0.2clusterProfiler
v3.14.3ggplot2
v3.3.2viridis
v0.5.1
The aforementioned packages are also required to launch the code on your computer.
- CORUM complexes database. Can be downloaded from here.
- Aberrant protein expression data to analyze. This should be the data obtained with either LIMMA or PROTRIDER method. Columns in such files should include the following columns:
- proteome_ID - IDs of patients For each dysregulated protein, the following columns should be specified:
- geneID - gene IDs
- UNIPROT_ID - UNIPROT IDs
- PROTEIN_ZSCORE - Z-scores
- PROTEIN_FC - Fold change
- PROTEIN_LOG2FC - log2(fold change)
- PROTEIN_PADJ - adjusted p-value
- PROTEIN_outlier - TRUE if the protein is considered as outlier; FALSE if not
- validated - TRUE if the protein has the validation in the previously obtained data.
The main script SMR1
is divided into several sections. The following overview will cover them consequently.
This section installs the packages necessary for the analysis. The viridis
package was chosen because of aesthetic preferences.
Three main functions were used in the analysis. data_prep
imports the CORUM database and aberrant expression results, filters them and returns the data frame prepared for GSEA, filtered CORUM database and name of the sorting variable chosen for ordering the results in GSEA. gene_set_enrichment
performs GSEA on the data frame returned by data_prep
and returns the results. Enrichment
is the wrapper function for the two functions described above.
This function takes four arguments as input:
path
: absolute path to folder with files including last slash. Do not specify the files themselves. Example:'/home/chorzow/BI/SMR1/'
expression_results
: filename of aberrant expression results. Should be provided in quotes. Example:'expression_results.tsv'
cor
: file containing the database of protein complexes. Should be provided in quotes. Default:'CORUMcoreComplexes.txt'
sorting_variable
: parameter by which sorting in downstream analysis (GSEA) will be performed. Should be one of'fc'
,'log2fc'
,'zscore'
,'padj'
.
The function returns a list containing several data frames, stored at the following indexes: 1 - data frame prepared for the GSEA; 2 - filtered CORUM database; 3 - chosen sorting variable that will be used for GSEA. You can extract the data frames by assigning the elements of this list to various variables (See Example).
Example:
dp <- data_prep(path = '/home/chorzow/BI/SMR1/,
expression_results = 'expression_results_LIMMA.tsv', sorting_variable = 'zscore')
expression_data <- dp[[1]]
corum <- dp[[2]]
sort_var <- dp[[3]]
This function performs the Gene Set Enrichment Analysis (GSEA). It takes four arguments as input:
- gsea_data: data file with prepared results (obtained from the data_prep function)
- corum: file containing the filtered database of protein complexes (obtained from the data_prep function)
- sorting_variable: parameter by which sorting in GSEA will be performed (obtained from the data_prep function)
- detection_method: method of aberrant expression detection. Should be specified in quotes and chosen according to the method used for obtaining aberrant expression results data. For now,
'PROTRIDER'
and'LIMMA'
are supported.
The function returns a data frame containing the final results of GSEA, ordered by patient ID.
Example:
res <- gene_set_enrichment(gsea_data = dp[[1]],
corum = dp[[2]],
sorting_variable = dp[[3]],
detection_method = 'LIMMA')
Since it could be inconvenient to call data_prep
and gene_set_enrichment
every time, we implemented a wrapper function for the previous two called Enrichment
. This function takes four arguments as input:
path
: absolute path to folder with files including last slash. Do not specify the files themselves. Example:'/home/chorzow/BI/SMR1/'
expression_results
: filename of aberrant expression results. Should be provided in quotes. Example:'expression_results_LIMMA.tsv'
sorting_variable
: parameter by which sorting in GSEA will be performed. Should be one of'fc'
,'log2fc'
,'zscore'
,'padj'
. Default:'zscore'
.detection_method
: method of aberrant expression detection. Should be specified in quotes and chosen according to the method used for obtaining aberrant expression results data. For now,'PROTRIDER'
and'LIMMA'
are supported.
Example:
protrider_fc <- Enrichment(path = '/home/chorzow/BI/SMR1/',
expression_results = 'Protrider_results_autoencoder_normalised.tsv',
sorting_variable = 'fc', detection_method = 'PROTRIDER')
In this section, we performed benchmarking of different aberrant detection methods with different sorting variables in GSEA. After that, we visualized number of protein complexes detected in each patient using different combinations of sorting variables and aberrant expression detection methods:
We noticed that the number of detected complexes was too big if we used adjusted p-value as a sorting variable. That is why we did not include this variable in the further analysis. We then visualized the sensitivity of each combination by plotting -log10 of adjusted p-value over different sorting variables:
Finally, we created some fancy cnetplots to visualize dysregulated protein complexes in some patients and extent of the dysregulation. For ethical reasons, we will not specify patient IDs neither in code nor on plots.
Zhou, B., Yan, Y., Wang, Y. et al. Quantitative proteomic analysis of prostate tissue specimens identifies deregulated protein complexes in primary prostate cancer. Clin Proteom 16, 15 (2019). https://doi.org/10.1186/s12014-019-9236-2 Jiang, L., Wang M., Lin S. et. al. A Quantitative Proteome Map of the Human Body. bioRxiv 797373; doi: https://doi.org/10.1101/797373 Kremer, L., Bader, D., Mertes, C. et al. Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat Commun 8, 15824 (2017). https://doi.org/10.1038/ncomms15824 Vicente A. Yépez, Christian Mertes, Michaela F. Mueller et al. Detection of aberrant events in RNA sequencing data, 03 January 2020, PROTOCOL (Version 1) available at Protocol Exchange [+https://doi.org/10.21203/rs.2.19080/v1+] Kikugawa, S., Nishikata, K., Murakami, K. et al. PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from H-Invitational protein-protein interactions integrative dataset. BMC Syst Biol 6, S7 (2012). https://doi.org/10.1186/1752-0509-6-S2-S7 Giurgiu M, Reinhard J, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Ruepp A. CORUM: the comprehensive resource of mammalian protein complexes-2019. Nucleic Acids Res. 2018 Oct 24. doi: 10.1093/nar/gky973. [Epub ahead of print] 30357367