Example implementation of a solution to subchallenge 1 of the BeatAML CTD^2 DREAM challenge. This example uses gene expression to train a RidgeRegression model for each inhibitor to predict AUC.
- Run Jupyter with
docker run -p 8888:8888 -v "$PWD:/home/jovyan" jupyter/scipy-notebook
- Stdout will include a URL to open the notebook
- Go through the steps in
index.ipynb
- The model will be stored in
model/
in two files:pkl_1.csv
andpkl_2.csv
- Read more about the model below
- The model will be stored in
This model can be run on the same data it was trained on, to test whether the Dockerfile works:
SYNAPSE_PROJECT_ID=<...>
docker build -t docker.synapse.org/$SYNAPSE_PROJECT_ID/sc1_model .
docker run -v "$PWD/training/:/input/" -v "$PWD/output:/output/" docker.synapse.org/$SYNAPSE_PROJECT_ID/sc1_model
SYNAPSE_PROJECT_ID=<...>
docker login docker.synapse.org
docker build -t docker.synapse.org/$SYNAPSE_PROJECT_ID/sc1_model .
docker push docker.synapse.org/$SYNAPSE_PROJECT_ID/sc1_model
One Ridge Regression model is trained for each inhibitor to predict AUC. The only input is gene expression (rnaseq.csv).
Specifics:
- The 1000 most variable genes are used for training
- The log2(cpm) values are normalized per-specimen
- The z-score is computed for each gene
- Ridge Regression is trained using hold-one-out cross-validation to predict AUC
The trained model is stored in two "pickles": pkl_1 and pkl_2:
- pkl_1: has one row per gene included in the model and N+3 columns (N is the number of inhibitors):
- gene: Include this gene's expression in the linear fit.
- gene_mean: The mean expression in the training data (to compute z-score).
- gene_std: The standard deviation of expression in the training data (to compute z-score).
- : The Ridge Regression weight coefficient for this gene for
inhibitor
.
- pkl_2: one row per inhibitor and two columns:
- inhibitor: The inhibitor name.
- intercept: The Ridge Regression intercept.