Name		Name	Last commit message	Last commit date
parent directory ..
figs		figs
logs		logs
.gitignore		.gitignore
10_PRS_score_summary.R		10_PRS_score_summary.R
11_copy_PRS_eval_plots.sh		11_copy_PRS_eval_plots.sh
12_PRS_dist_summary.ipynb		12_PRS_dist_summary.ipynb
1_phe_prep.ipynb		1_phe_prep.ipynb
2a_snpnet.biomarkers.sbatch.v3.sh		2a_snpnet.biomarkers.sbatch.v3.sh
2b_snpnet.blood.sbatch.v3.sh		2b_snpnet.blood.sbatch.v3.sh
3_export_intermediate.sh		3_export_intermediate.sh
4_covar_score.R		4_covar_score.R
5_eval.R		5_eval.R
6_update_evals.sh		6_update_evals.sh
7_aggregate_eval_metrics.R		7_aggregate_eval_metrics.R
7_aggregate_eval_metrics.sh		7_aggregate_eval_metrics.sh
8_copy_PRS_weights.sh		8_copy_PRS_weights.sh
9_PRS_plots.R		9_PRS_plots.R
9_PRS_plots.sh		9_PRS_plots.sh
PRS.dist.png		PRS.dist.png
PRS.dist.statistic.tsv		PRS.dist.statistic.tsv
README.md		README.md
functions.R		functions.R
metric.R2.tsv		metric.R2.tsv
metric.quantile.summary.blood.subset.tsv		metric.quantile.summary.blood.subset.tsv
metric.quantile.summary.blood.tsv		metric.quantile.summary.blood.tsv
metric.summary.biomarkers.tsv		metric.summary.biomarkers.tsv
metric.summary.blood.tsv		metric.summary.blood.tsv
pheno.info.tsv		pheno.info.tsv
snpnet_job_submission.md		snpnet_job_submission.md

README.md

Polygenic risk scores (PRSs) from `snpnet`

To train polygenic risk models for biomarkers and blood measurements, we fit a multivariate Lasso model for each trait using the snpnet package[1,2,3]. The snpnet package and BASIL algorithm are described in our preprint[1] and its enhanced implementation is described in another preprint[2]. The R package is available in our GitHub repository[3].

List of 61 traits used in our analysis

pheno.info.tsv has the list of 66 traits used in this analysis. This consists of 35 blood and urine biomaker traits[4] and 31 blood measurements[5]. The phenotype definition of the blood measurements are described in this paper[6].

Our PRS models

Our PRSs are in the form of multivariate regression. Specifically, we fit the following models for each dataset.

blood measurements: y ~ age + sex + Array + N_CNV + LEN_CNV + PC1 + ··· + PC10 + Σ_j G_j
- y is the blood measurements in the original scale
- G_j: genetic variants, which includes SNVs and small indels from the genotyping array and the imputed dataset, HLA allelotypes, and copy number variations (CNVs)[7]. We used in total of 5182706 variants in this analysis.
- age is computed from the birth year of the individual
- sex is an indicator variable (1 indicates male; 0 indicates female)
- Array is an indicator variable for the types of genotyping array used in UK Biobank study (1 indicates UKBB array; 0 indicates UKBL array)
- N_CNV and LEN_CNV: the number and length of CNVs.
- PC1-PC10: the first 10 genotype PCs.
35 biomarker traits: y ~ N_CNV + LEN_CNV
- y is the covariate-adjusted biomarker measurements. Please see our manuscript[4] for more information about our covariate adjstment procedure. Note that we have updated the covariate adjustment procedure since the initial draft of the manuscript and are still preparing the updated manuscript.
- G_j, N_CNV, and LEN_CNV: genetic variant and the number and length of CNVs as described above.

Performance of polygenic risk scores (PRSs)

We evaluated the performance of PRSs with R² and summarize them in two tables, metric.summary.blood.tsv and metric.summary.biomarkers.tsv, for the blood measurements[5] and the biomarker traits[4], respectively. Those two files have the following 4 columns:

phenotype: the phenotype name
geno_and_covars: the R² in the test set using the risk score computed from both genetic and covariate information.
geno: the R² in the test set using the risk score computed from genetic variation alone.
covars: the R² in the test set using the risk score computed from covariates alone.

Note: for the 35 biomarker phenotypes, we used covariate-adjusted biomarker measurements for the inputs for the model and the "covariates" in this evaluation specifically refers to two additional covariates, N_CNV, and LEN_CNV. For the blood measurements, we used the original values as the input.

We also provide the performance metrics computed on the training and the validation set, which is used to choose the optimial L₁ regularization parameter for Lasso regression, in metric.R2.tsv. It has the following 13 colmuns:

1. GBE_ID: the phenotype ID in our dataset release.
1. pheno: the full phenotype name.
1. pheno_plot: a simplified phenotype names, which we often use for visualization.
1. dataset: indicates whether the trait belongs to the blood measurements or the 35 biomarkers.
5-13. The R² metric computed for the training, validation, and test sets. We report the metric using the three sets of features (genetic information + covariates, genetic information alone, and covariates alone) as described above. The column names are in the form of <data set split>_<feature set>.

Additionally, we also generated performance plots for PRSs of the blood measurements (available in figs directory).

Figure. Polygenic risk scores and their relationship to predicted lymphocyte percentage. (left) Relationship between PRS for lymphocyte percentage (%) and lymphocyte percentage (%) in a held out test set. (right) Lymphocyte percentage (%) and its corresponding standard error at each PRS quantile of lymphocyte percentage PRS.

Weights of polygenic risk scores

The weights of our PRS models are available from our Google Drive shared folder: https://bit.ly/rivas-lab_covid19_PRS_weights.

In the shared directory, you should be able to find a tar.gz file, named as rivas-lab_covid19_PRS_weights.YYYYMMDD-hhmmss.tar.gz. The YYYYMMDD-hhmmss represents the date and time of the data release.

Once you extracted the tar file ($ tar -xzvf rivas-lab_covid19_PRS_weights.YYYYMMDD-hhmmss.tar.gz), you should able to see two files for each trait in our analysis.

<trait ID>.tsv: the PRS weights for genetic variants. This file has the following columns:
- CHROM: the chromosome
- POS: the position (we use GRCh37 assembly)
- ID: the variant ID
- REF: the reference allele
- ALT: the alternate allele
- BETA: the effect size estimate in our PRS model, computed for the ALT allele.
<trait ID>.covars.tsv the PRS weights for covariates. This file has the following columns:
- ID: the covariates
- BETA: the effect size estimate in our PRS model

For blood related measurements, we additionall provide plots for the performance evaluation.

<trait ID>.sscore.png: two panel plots summarizing the comparison of the predicted trait values (based on non-covariate features alone) and the observed trait value. Those files are the same ones as in figs directory.
<trait ID>.sscore.summary.tsv: the source files used to generate the panel on the right. It has the following columns:
- bin_str: this column contains the intervals for the PRS bins. Computer friendly formats are in l_bin (lower bound) and u_bin (upper bound).
- mean_str: the trait value in the specified bin in our test set. Computer friendly formats are in mean (mean value), std_err (standard error), and l_err and u_err (error bars).

Our trait IDs are written as GBE_ID in pheno.info.tsv. The full performance metric table (metric.R2.tsv) also has the GBE_ID column.

Normalization of PRS scores

We provide summary statistics (mean, sd, 25,50,75%-tile values in our validation set) of the PRSs so that one can compute the percentile of the PRS for new individuals. Please check PRS.dist.statistic.tsv. The distribution of the PRSs across 8 blood measurements are plotted in PRS.dist.png.

List of scripts/notebooks

1_phe_prep.ipynb: phenotype file prep. We merge the GWAS covariates file, master phe file (for blood measurements phenotypes), and biomarker phenotype file (which is not a part of master phe file).
2a_snpnet.biomarkers.sbatch.v3.sh: snpnet sbatch script for the biomarker phenotypes.
2b_snpnet.blood.sbatch.v3.sh: snpnet sbatch script for the blood measurement phenotypes.
3_export_intermediate.sh: a script to export the intermediate results
4_covar_score.R: a script to compute the risk scores from the covariates
5_eval.R: a script to compute R² metric.
6_update_evals.sh: a wrapper script to update the evaluation metrics.
7_aggregate_eval_metrics.sh and 7_aggregate_eval_metrics.R : a pair of scripts to update the metric tables.
8_copy_PRS_weights.sh: a script to copy the PRS weights and upload to the Google Drive shared folder.
9_PRS_plots.R and 9_PRS_plots.sh: this script is used to generate the PRS evaluation plot.
10_PRS_score_summary.R: this script summarizes the mean and the standard error of the phenotypes stratified by PRS.
11_copy_PRS_eval_plots.sh: this script copies the plots generated from 9_PRS_plots.R to figs directory.
12_PRS_dist_summary.ipynb: this notebook was used to compute summary statistics for PRS normalization (PRS.dist.statistic.tsv).

Job submission commands

The job submission commands are summarized in this document: snpnet_job_submission.md.

Acknowledgement

We thank Stanford Research Computing Center for providing prioritized queue for COVID-19 research[8].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snpnet

snpnet

README.md

Polygenic risk scores (PRSs) from `snpnet`

List of 61 traits used in our analysis

Our PRS models

Performance of polygenic risk scores (PRSs)

Weights of polygenic risk scores

Normalization of PRS scores

List of scripts/notebooks

Job submission commands

Acknowledgement

Reference

Files

snpnet

Directory actions

More options

Directory actions

More options

Latest commit

History

snpnet

Folders and files

parent directory

README.md

Polygenic risk scores (PRSs) from snpnet

List of 61 traits used in our analysis

Our PRS models

Performance of polygenic risk scores (PRSs)

Weights of polygenic risk scores

Normalization of PRS scores

List of scripts/notebooks

Job submission commands

Acknowledgement

Reference

Polygenic risk scores (PRSs) from `snpnet`