To train polygenic risk models for biomarkers and blood measurements, we fit a multivariate Lasso model for each trait using the snpnet
package[1,2,3]. The snpnet
package and BASIL
algorithm are described in our preprint[1] and its enhanced implementation is described in another preprint[2]. The R
package is available in our GitHub repository[3].
pheno.info.tsv
has the list of 66 traits used in this analysis. This consists of 35 blood and urine biomaker traits[4] and 31 blood measurements[5]. The phenotype definition of the blood measurements are described in this paper[6].
Our PRSs are in the form of multivariate regression. Specifically, we fit the following models for each dataset.
- blood measurements: y ~ age + sex + Array + N_CNV + LEN_CNV + PC1 + ··· + PC10 + Σj Gj
y
is the blood measurements in the original scale- Gj: genetic variants, which includes SNVs and small indels from the genotyping array and the imputed dataset, HLA allelotypes, and copy number variations (CNVs)[7]. We used in total of 5182706 variants in this analysis.
- age is computed from the birth year of the individual
- sex is an indicator variable (1 indicates male; 0 indicates female)
- Array is an indicator variable for the types of genotyping array used in UK Biobank study (1 indicates UKBB array; 0 indicates UKBL array)
- N_CNV and LEN_CNV: the number and length of CNVs.
- PC1-PC10: the first 10 genotype PCs.
- 35 biomarker traits: y ~ N_CNV + LEN_CNV
y
is the covariate-adjusted biomarker measurements. Please see our manuscript[4] for more information about our covariate adjstment procedure. Note that we have updated the covariate adjustment procedure since the initial draft of the manuscript and are still preparing the updated manuscript.- Gj, N_CNV, and LEN_CNV: genetic variant and the number and length of CNVs as described above.
We evaluated the performance of PRSs with R2 and summarize them in two tables, metric.summary.blood.tsv
and metric.summary.biomarkers.tsv
, for the blood measurements[5] and the biomarker traits[4], respectively. Those two files have the following 4 columns:
phenotype
: the phenotype namegeno_and_covars
: the R2 in the test set using the risk score computed from both genetic and covariate information.geno
: the R2 in the test set using the risk score computed from genetic variation alone.covars
: the R2 in the test set using the risk score computed from covariates alone.
Note: for the 35 biomarker phenotypes, we used covariate-adjusted biomarker measurements for the inputs for the model and the "covariates" in this evaluation specifically refers to two additional covariates, N_CNV, and LEN_CNV. For the blood measurements, we used the original values as the input.
We also provide the performance metrics computed on the training and the validation set, which is used to choose the optimial L1 regularization parameter for Lasso regression, in metric.R2.tsv
. It has the following 13 colmuns:
-
- GBE_ID: the phenotype ID in our dataset release.
-
- pheno: the full phenotype name.
-
- pheno_plot: a simplified phenotype names, which we often use for visualization.
-
- dataset: indicates whether the trait belongs to the blood measurements or the 35 biomarkers.
- 5-13. The R2 metric computed for the training, validation, and test sets. We report the metric using the three sets of features (genetic information + covariates, genetic information alone, and covariates alone) as described above. The column names are in the form of
<data set split>_<feature set>
.
Additionally, we also generated performance plots for PRSs of the blood measurements (available in figs
directory).
Figure. Polygenic risk scores and their relationship to predicted lymphocyte percentage. (left) Relationship between PRS for lymphocyte percentage (%) and lymphocyte percentage (%) in a held out test set. (right) Lymphocyte percentage (%) and its corresponding standard error at each PRS quantile of lymphocyte percentage PRS.
The weights of our PRS models are available from our Google Drive shared folder: https://bit.ly/rivas-lab_covid19_PRS_weights.
In the shared directory, you should be able to find a tar.gz
file, named as rivas-lab_covid19_PRS_weights.YYYYMMDD-hhmmss.tar.gz
. The YYYYMMDD-hhmmss
represents the date and time of the data release.
Once you extracted the tar file ($ tar -xzvf rivas-lab_covid19_PRS_weights.YYYYMMDD-hhmmss.tar.gz
), you should able to see two files for each trait in our analysis.
<trait ID>.tsv
: the PRS weights for genetic variants. This file has the following columns:- CHROM: the chromosome
- POS: the position (we use
GRCh37
assembly) - ID: the variant ID
- REF: the reference allele
- ALT: the alternate allele
- BETA: the effect size estimate in our PRS model, computed for the ALT allele.
<trait ID>.covars.tsv
the PRS weights for covariates. This file has the following columns:- ID: the covariates
- BETA: the effect size estimate in our PRS model
For blood related measurements, we additionall provide plots for the performance evaluation.
<trait ID>.sscore.png
: two panel plots summarizing the comparison of the predicted trait values (based on non-covariate features alone) and the observed trait value. Those files are the same ones as infigs
directory.<trait ID>.sscore.summary.tsv
: the source files used to generate the panel on the right. It has the following columns:bin_str
: this column contains the intervals for the PRS bins. Computer friendly formats are inl_bin
(lower bound) andu_bin
(upper bound).- mean_str: the trait value in the specified bin in our test set. Computer friendly formats are in
mean
(mean value),std_err
(standard error), andl_err
andu_err
(error bars).
Our trait IDs are written as GBE_ID
in pheno.info.tsv
. The full performance metric table (metric.R2.tsv
) also has the GBE_ID
column.
We provide summary statistics (mean, sd, 25,50,75%-tile values in our validation set) of the PRSs so that one can compute the percentile of the PRS for new individuals. Please check PRS.dist.statistic.tsv
. The distribution of the PRSs across 8 blood measurements are plotted in PRS.dist.png
.
1_phe_prep.ipynb
: phenotype file prep. We merge the GWAS covariates file, master phe file (for blood measurements phenotypes), and biomarker phenotype file (which is not a part of master phe file).2a_snpnet.biomarkers.sbatch.v3.sh
: snpnet sbatch script for the biomarker phenotypes.2b_snpnet.blood.sbatch.v3.sh
: snpnet sbatch script for the blood measurement phenotypes.3_export_intermediate.sh
: a script to export the intermediate results4_covar_score.R
: a script to compute the risk scores from the covariates5_eval.R
: a script to compute R2 metric.6_update_evals.sh
: a wrapper script to update the evaluation metrics.7_aggregate_eval_metrics.sh
and7_aggregate_eval_metrics.R
: a pair of scripts to update the metric tables.8_copy_PRS_weights.sh
: a script to copy the PRS weights and upload to the Google Drive shared folder.9_PRS_plots.R
and9_PRS_plots.sh
: this script is used to generate the PRS evaluation plot.10_PRS_score_summary.R
: this script summarizes the mean and the standard error of the phenotypes stratified by PRS.11_copy_PRS_eval_plots.sh
: this script copies the plots generated from9_PRS_plots.R
tofigs
directory.12_PRS_dist_summary.ipynb
: this notebook was used to compute summary statistics for PRS normalization (PRS.dist.statistic.tsv
).
The job submission commands are summarized in this document: snpnet_job_submission.md
.
We thank Stanford Research Computing Center for providing prioritized queue for COVID-19 research[8].
- Qian, J. et al. A Fast and Flexible Algorithm for Solving the Lasso in Large-scale and Ultrahigh-dimensional Problems. bioRxiv 630079 (2019).
- Li, R. et al. Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank. bioRxiv 2020.01.20.913194 (2020).
- GitHub:rivas-lab/snpnet - Efficient Lasso Solver for Large-scale genetic variant data. (Rivas Lab, 2019).
- Sinnott-Armstrong, N. et al. Genetics of 38 blood and urine biomarkers in the UK Biobank. bioRxiv 660506 (2019).
- UK Biobank : Category 100081. Blood count - Blood assays - Assay results - Biological samples.
- Tanigawa, Y. et al. Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology. Nat Commun 10, 1–14 (2019).
- Aguirre, M., Rivas, M. A. & Priest, J. Phenome-wide Burden of Copy-Number Variation in the UK Biobank. The American Journal of Human Genetics 105, 373–383 (2019).
- Sherlock joins the fight against COVID-19. Stanford Research Computing Center.