This folder contains all code used to generate all models and raw results. The scripts for CROSS-VALIDATION and FINAL MODELS below should be run in the order listed:
- Training 5x5 models:
- Train_CV_Models-MDAD.py
- Train_CV_Models-Baselines.py (linear and MLP models)
- These scripts each also save CV performance metrics which are displayed in Figure 2a.
- Evaluating prediction performance:
- select_final_models_from_CV.py (based on CV performance, selects hyper parameters for the final models)
- ROSMAP_data_validation_experiments.py (Evaluates just ROSMAP test set performance when training with different subsets of training data sets; Figure 2b)
- Compute last shared layer embeddings:
- save_CV_model_embeddings.py (Computes each splits’ final model’s last shared layer embedding for the train and test set. Models: MD-AD, MLPs, Unsupervised baselines)
- These embeddings are evaluated in Paper_Analyses/Evaluate_Embedding_Correlations.ipynb and Paper_Analyses/t-SNE_plots_internal.ipynb
- Train_Final_Model_MLP_baseline.py
- Train_Final_Model_MDAD.py (trains over multiple runs which will be used for consensus model)
- save_MDAD_final_predictions.py (saves model predictions for the full dataset over all runs; then saves a “consensus prediction” by averaging predictions across runs)
- save_final_model_embeddings_runs.py (saves model last shared layer embeddings for the full dataset over all runs. Also saves the analogous layer embedding of the MLPs)
- calculate_consensus_MD-AD_embedding.py (Generates a final consensus MD-AD embedding: (1) K-means clustering over all last shared layer node embeddings across runs, (2) for each cluster, identifies the nearest node to the center, (3) these centroids form the new embedding.)
Both the External_Validation and Model_Interpretation scripts below rely on completion of the above scripts, but they can be run independently of each other.
For new unseen datasets, we evaluate MD-AD with two possible approaches: (1) If many of the genes are common between the external and original datasets, we impute missing genes (using linear regression trained on available genes) and then directly apply the final MD-AD model on the new samples. (2) If many genes are missing, we instead train a new MD-AD model on intersecting genes (and relevant baselines) and then evaluate the new custom model on the external dataset.
- Option 1: Direct application of MD-AD model (human brain samples):
- Save_transfer_predictions_runs.ipynb
- Option 2: Training new models with intersecting genes (mouse brain samples, human blood samples):
- Train_intersecting_MTL_models.py
- Train_intersecting_MLP_models.py
- Save_intersecting_model_predictions_runs.ipynb
- For both approaches above:
- Generate linear model predictions: linear_model_ext_val.ipynb
- Generate embeddings for the MD-AD model: save_MDAD_embeddings_ext_val.ipynb
- Generate consensus predictions for the MD-AD model: compute_final_model_predictions.ipynb
- Does some cleaning of results files to make plotting easier: save_cleaned_results_for_plots.ipynb
This folder contains code to extract Integrated Gradients values for each gene on each prediction. These scores are then used to rank genes according to their relevance to MD-AD’s pathology predictions. We also use GSEA to examine the enriched gene sets in the final model rankings.
- Get_MDAD_IG_weights.ipynb - Computes IG weights for each MD-AD run. For each run, generates a (# samples x # genes) matrix of IG values for each of the 6 phenotype outputs
- IG_weights_averaging.ipynb - Performs two kinds of averaging over the gene weights above:
- Averaging over samples: For each run, we obtain average gene scores for each phenotype (#runs x #samples x #genes x #phenotypes) --> (#runs x #genes x #phenotypes)
- Averaging over runs and phenotypes, so for each sample, we obtain a gene importance score: (#runs x #samples x #genes x #phenotypes) --> (#samples x #genes)
- Get_ranked_MDAD_genes.ipynb - Generates consensus gene scores across runs to rank the relative impact of each gene on predicted neuropathology severity
- Get_ranked_correlation_based_genes.ipynb - Baseline comparison approach for the MD-AD ranking. We obtain correlation-based gene rankings by averaging over the correlation coefficients between genes and each phenotype
- Run_GSEA.ipynb - Based on both MD-AD and correlation-based rankings, we check for the enrichment of gene sets in the rankings.
These notebooks process results from the MD-AD pipeline and generate figures presented in the paper:
- Figure 2a - Paper_Analyses/CV_Prediction_Performance.ipynb
- Figure 2b - Paper_Analyses/Subsets_CV_ROSMAP_Plots.ipynb
- Figures 2c-d, 7a-b - Paper_Analyses/External_validation_predictions.ipynb
- Figures 3a-c - Paper_Analyses/Evaluate_Embedding_Correlations.ipynb and Paper_Analyses/t-SNE_plots_internal.ipynb
- Figures 3d-e, 7c - Paper_Analyses/t-SNE_embeddings_external.ipynb
- Figure 4a-b, 6c - Paper_Analyses/Final_genes_color_by_genesets.ipynb
- Figure 4c - Paper_Analyses/Final_genes_rank_comparisons.ipynb
- Figures 5 and 6a-b - Paper_Analyses/Final_genes_interaction_analyses.ipynb
This software was originally designed and run on a system running Ubuntu 16.04.3 with Python 3.3.6. For neural network model training and interpretation, we used a single Nvidia GeForce GTX 980 Ti GPU, though we anticipate that other GPUs will also work. Standard python software packages used: Tensorflow (1.3.0), Keras (2.0.4), numpy (1.17.3), pandas (0.24.1), scipy (1.3.1), scikit-learn (0.21.3), matplotlib (3.1.2), seaborn (0.9.0), h5py (2.9.0). We additionally used the following Python software packages available here: IntegratedGradients, and GSEApy.
Gene expression and phenotype data: The results published here are in based on data obtained from the AD Knowledge Portal. Postmortem brain gene expression samples and phenotype labels are available from the AD Knowledge Portal for the following data sets (with listed synapse.org Synapse IDs): ACT (syn5759376), ROSMAP (syn3219045), MSBB (RNA Sequencing: syn3159438, Microarray: syn3157699), Mayo Clinic Brain Bank (syn5550404). Requirements for use of these data sets are listed on the synapse pages for each data set. HBTRC data was downloaded from the Gene Expression Omnibus (GEO) under accession code GSE44772. Human blood gene expression and phenotype data from the AddNeuroMed cohort are available from GEO under accession codes GSE63060 and GSE63061. Mouse brain gene expression samples and associated phenotypes are available from GEO under accession code GSE64398.
Pathways and gene sets: In our analyses, we evaluated our results with respect to publically available gene sets. These include REACTOME and KEGG pathways available from MSigDB (c2 pathways v7.0). We also obtained gene signatures from Olah et al. (2020) and Mathys et al. (2019) (each available as supplementary data from the respective papers).