Skip to content
/ SHDC Public

Study the similarities and differences in gene expression between men and women across human diseases

License

Notifications You must be signed in to change notification settings

jonsv89/SHDC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sex-specific transcriptome similarity networks elucidate comorbidity relationships

The goal of this project is to study the similarities and differences in gene expression between men and women across human diseases to better understand comorbidity differences between genders.

Gene expression analysis

Raw expression data was downloaded from the Gene Expression Omnibus (GEO) and ArrayExpress (https://www.ebi.ac.uk/arrayexpress). Studies conducted on the HG U133Plus 2 Affymetrix microarray platform were chosen based on their cost-effectiveness and reproducibility, given their potential to translate findings into clinical practice. This selection also considered the large sample sizes necessary for analyzing disease-disease associations while minimizing biases that could arise from using different platforms or technologies. After removing low-quality samples (GNUSE median values higher than 1.25), 128 diseases with at least three cases and three controls were kept. As samples’ sex information is only given for 52.23% (6,685/12,797) of the samples, the massiR package was used to infer samples’ sex. This method classifies samples as male or female by analyzing the expression of the probes for Y chromosome genes. To homogenize the terms and to make the transcriptomic similarity networks that were generated in the following steps comparable to previously published epidemiological networks, the disease names were transformed into 3-digit codes of the International Classification of Diseases, version 10 (ICD10), of the World Health Organization, grouping specific diseases into a single 3-digit ICD code. MAS 5.0 algorithm was used to identify and remove lowly expressed genes (detection p-value<0.05). Background correction, normalization, and summarization were done using the frozen Robust Multiarray Analysis (fRMA) preprocessing algorithm (default parameters). Differential expression analyses comparing the expression of samples with disease (cases) vs. disease-free samples (controls) were conducted using the linear regression model provided by the LIMMA package. These comparisons were performed separately and jointly for women and men, adjusting for confounding variables (study of origin and sex). Those genes with a False Discovery Rate (FDR) <= 0.05 and a log Fold Change (logFC) lower (higher) than zero were considered as significantly down- (up-) regulated.

Gene set enrichment analysis

We performed functional enrichment analyses using the Gene Set Enrichment Analysis (GSEA) method, which was applied to the entire list of logFC-ranked genes obtained from the differential expression analysis (see previous section). The resources used in GSEA were Reactome, Gene Ontology, and KEGG. We performed disease clustering using Reactome pathways specifically, whose hierarchical structure allows us to identify 29 lowest-level pathway diagram annotations or categories. Pairwise Euclidean distances were calculated between Reactome pathway enrichment profiles (Normalized Enrichment Score values provided by GSEA). Ward2 methodwas used to generate the clusters, identifying significant disease clusters through bootstrapping using the pvclust R package. The resulting dendrograms obtained for women and men were compared using tanglegrams from the dendextend R package, which gives two dendrograms (with the same set of labels), one facing the other, and having their labels connected by lines.

Network construction

Transcriptional similarities were calculated using three distinct sets: (i) the complete list of annotated genes, (ii) the union of annotated genes with significant differential expression (sDEGs), and (iii) their intersection based on differential expression values (logFC). Six similarity metrics were calculated: Pearson’s and Spearman’s coefficients, cosine similarity, and Euclidean, Canberra, and Manhattan distances. Empirical p-values were calculated through 10,000 permutations for the cosine similarity and the Euclidean, Canberra, and Manhattan distances, correcting for multiple testing by the Bonferroni approach and considering those similarities with an FDR<=0.05 as significant. In the case of Euclidean, Canberra, and Manhattan distances, the mean of the random distances was compared with the actual distances, obtaining positive (negative) values indicating a greater (lesser) similarity than expected by chance. The similarity values – obtained from the comparison between real and random distances in the case of Euclidean, Canberra and Manhattan distances, and from the coefficients in the case of Pearson and Spearman correlations and cosine similarity – were binarized, converting those coefficients greater than 0 to +1 and those less than 0 to -1.

Overlap with epidemiology

We used a previously published epidemiological network to identify the comorbidity relationships recovered by the disease transcriptomic similarity networks (DTSN) generated by comparing similarities between differential expression profiles and their ability to explain comorbidity relationships. The overlap between networks was performed on the shared set of diseases (present in both the DTSNs and epidemiological networks). Specifically, the overlap of positive and negative transcriptomic similarities with the epidemiological networks was analyzed separately. Overlaps were measured by sex (women vs. women, men vs. men). The significance of the overlap was assessed by Fisher’s tests and the use of a set of specifically defined randomisations (generating 10,000 random networks shuffling the edges of the DTSNs while maintaining the degree distribution).

Disease-drug associations

To study the potential sex-specific role of drugs in comorbidities, we retrieved drug targets from the DrugBank. Since the number of targets per drug is relatively small for enrichment analyses, we used the protein-protein interaction network extracted from IID - selecting only those protein-protein interactions in humans that have been experimentally verified – to expand the number of targets associated with a given drug by mapping the targets on the network and selecting the first neighbours of the targets for each drug. We then conducted a GSEA to associate drugs targeting the products of up or down-regulated genes with the corresponding disease, separately by sex. Disease-drug associations were extracted from the SIDER database. Disease names were transformed into ICD10 codes using the Unified Medical Language System, and DrugBank IDs were mapped into drug names.

About

Study the similarities and differences in gene expression between men and women across human diseases

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published