More on PGS: https://choishingwan.github.io/PRS-Tutorial/
README.md
: this filepgs/pgs.py
: Python class definitions setup to create bash commands for different PGS tools (plink
,PRSice2
,LDpred2
, etc.)run_pgs_synthetic.py
: Python run script setup to run PGS tools on synthetic datas provided in this repository, mainly for testing purposes.run_pgs_w_QC.py
: Python run script setup to run PGS tools on datas provided in this repository, mainly for testing purposes. Performs basic QC steps.run_pgs_MoBa_*.py
: Python run scripts to run PGS tools on MoBa datas.pgs_exec.py
:gwas.py
like command line tool that can be used to run, create bash and slurm job scripts for different PGS tools based on user input.vis_pgs_synthetic.ipynb
: Jupyter notebook plotting/comparing the output PGS scores generated by therun_pgs_synthetic.py
script.vis_pgs_w_QC.ipynb
: Jupyter notebook plotting/comparing the output PGS scores generated by therun_pgs_w_QC.py
script.start_jupyter_server.py
: start a Jupyter server using thepython3.sif
container (allows jupyter notebooks within VSCode with the Jupyter extension)config.yaml
: YAML file defining some parameters for Slurm jobscripts and PGS methodspgs_exec_example_*.sh
: Example bash scripts forpgs_exec.py
requirements.txt
: Python package requirements. Install withpip install -r requirements.txt
Rscripts/*.R
: misc.R
scripts defining or being used by the PGS tools or optional QC steps.tests/
: directory for unit tests, executable withpy.test -v tests
in this directory
The basic requirements for running these codes (sans project specific genomics data) are the same as for the rest of this project, i.e., a basic Python environment and a working Singularity/Apptainer installation.
Running these codes requires Python 3.8+ (tested/developed mainly using Python 3.9+) and a working Singularity/Apptainer installation.
To work with these codes, some input files are required. These are:
GWAS summary files formatted according to the sumstats specification
Phenotypic data formatted according to the phenotype specification
Genotypic data formatted according to the genotype specification
Covariate data formatted according to the covariate specification
The output will be written to a user-specified output/
directory, which is created if it does not exist. The output directory will contain subdirectories for each PGS method, e.g., output/PGS_synthetic_plink/
as specified in the runtime scripts.
The individ level output file shared by all PGS methods is named test.score
. The text file contains the PGS scores for each individual in the phenotype file. The first two columns are FID
and IID
. The third column score
is the PGS score.
NOTE: For PLINK and PRSice-2, the
score
column contains the "best" PGS score, i.e., the one with the highest R2 for the tested range of p-value thresholds. The scores for each p-value threshold are also written to the output directory as separate files.
The file test_summary.txt
contains the R (generalized) linear model (LM/GLM) summary statistics for the PGS model in plain-text format,
while test_summary.csv
contains the LM/GLM summary statistics in tabular (.csv) format.
Both the full model and the null model are reported, typically assumed to be on the form
full model
null model
nocov model
For binary traits, the GLM should use the binomial family and the logit link function to fit the model.
For binary traits, we also summarize Odds Ratios (OR) and 95% confidence intervals (CI) for the PGS models.
These outputs are written to plaintext and tabular files named test_summary.or.txt
and test_summary.<null/full/nocov>.or.csv
, respectively.
Run PGS using PLINK, PRSice2 and LDpred2 on synthetic data provided in this repository, namely the files:
- summary statistics:
/REF/examples/ldpred2/trait1.sumstats.gz
- phenotype data:
/REF/examples/ldpred2/simu.pheno
- genotype data:
/REF/examples/ldpred2/g1000_eur_chr21to22_hm3rnd1.bed/bim/fam
- covariate data:
/REF/examples/prsice2/EUR.cov
NOTE: Files from a full clone of https://github.com/comorment/ldpred2_ref with LDpred2 reference data is required and should be located in a directory
ldpred2_ref
as defined in theconfig.yaml
file.
Running the script:
$ python3 run_pgs_synthetic.py
environment variables in use:
ROOT_DIR: /nrec/space/espenh
CONTAINERS: /nrec/space/espenh/containers
...
# A tibble: 1 × 12
r.squared adj.r.squ…¹ sigma stati…² p.value df logLik AIC BIC devia…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.765 0.762 0.494 201. 3.47e-150 8 -354. 728. 770. 120.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
# variable names ¹adj.r.squared, ²statistic, ³deviance
[1] "R2 (null model): 0.0250978564710413"
[1] "R2 (full model): 0.765387984851408"
NOTE: The script may require other packages than Python builtins, such as
pandas
andpyyaml
. Install these by issuingpip install <package>
in the current Python environment. There is alsopip install -r requirements.txt
The output will be added to the output/PGS_synthetic_<method>/
directories.
Run PGS using PLINK, PRSice2 and LDpred2 on tutorial data provided in this repository, including some basic QC steps on the data based on suggestions from this tutorial.
The main input files are:
- summary statistics:
/REF/examples/prsice2/Height.gwas.txt.gz
- phenotype data:
/REF/examples/prsice2/EUR.height
- genotype data:
/REF/examples/prsice2/EUR.bed/bim/fam
- covariates:
/REF/examples/prsice2/EUR.cov
- eigenvectors:
/REF/examples/prsice2/EUR.eigenvec
NOTE: Files from a full clone of https://github.com/comorment/ldpred2_ref with LDpred2 reference data is required and should be located in a directory
ldpred2_ref
as defined in theconfig.yaml
file.
Running the script:
python3 run_pgs_w_QC.py
Run basic PGS for height using PLINK, PRSice2 and LDpred2, respectively on MoBa child sample data (only on TSD p697).
Run, create bash and slurm job scripts for different PGS tools
Get a list of options:
$ python3 pgs_exec.py --help
usage: PGS [-h] [--method {plink,prsice2,ldpred2-inf,ldpred2-auto}] [--config CONFIG] [--sumstats-file SUMSTATS_FILE] [--pheno-file PHENO_FILE] [--phenotype PHENOTYPE] [--phenotype-class {CONTINUOUS,BINARY}] [--geno-file-prefix GENO_FILE_PREFIX] [--output-dir OUTPUT_DIR] [--runtype {sh,slurm,subprocess}]
{plink,prsice2,ldpred2-inf,ldpred2-auto} ...
A pipeline for PGS analysis
positional arguments:
{plink,prsice2,ldpred2-inf,ldpred2-auto}
optional arguments:
-h, --help show this help message and exit
--method {plink,prsice2,ldpred2-inf,ldpred2-auto}
Method for PGS
--config CONFIG config YAML file
--sumstats-file SUMSTATS_FILE
summary statistics file
--pheno-file PHENO_FILE
phenotype file
--phenotype PHENOTYPE
phenotype name (must be a column header in ``pheno_file``)
--phenotype-class {CONTINUOUS,BINARY}
phenotype class
--geno-file-prefix GENO_FILE_PREFIX
file path to .bed, .bim, .fam, etc. files
--output-dir OUTPUT_DIR
Output file directory
--runtype {sh,slurm,subprocess}
operation mode
Example w. PRSice2 as subprocess on synthetic dataset (pgs_exec_example_1.sh
):
python3 pgs_exec.py \
--sumstats-file /REF/examples/ldpred2/trait1.sumstats.gz \
--pheno-file /REF/examples/ldpred2/simu.pheno \
--phenotype trait1 \
--phenotype-class CONTINUOUS \
--geno-file-prefix /REF/examples/ldpred2/g1000_eur_chr21to22_hm3rnd1 \
--output-dir output/PGS_synthetic_prsice2 \
--runtype subprocess \
prsice2 \
--covariate-file /REF/examples/prsice2/EUR.cov \
--eigenvec-file output/PGS_synthetic_prsice2/g1000_eur_chr21to22_hm3rnd1.eigenvec
NOTE: The last two lines will override settings for
method: prsice2
inconfig.yaml
file, being parsed to thePRSice2.r
script
Example w. LDpred2-inf via shell (sh
) script on synthetic dataset (pgs_exec_example_2.sh
):
python3 pgs_exec.py \
--sumstats-file /REF/examples/ldpred2/trait1.sumstats.gz \
--pheno-file /REF/examples/ldpred2/simu.pheno \
--phenotype trait1 \
--phenotype-class CONTINUOUS \
--geno-file-prefix /REF/examples/ldpred2/g1000_eur_chr21to22_hm3rnd1 \
--output-dir output/PGS_synthetic_LDpred2_inf \
--runtype sh \
ldpred2-inf \
--covariate-file /REF/examples/prsice2/EUR.cov \
--eigenvec-file output/PGS_synthetic_LDpred2_inf/g1000_eur_chr21to22_hm3rnd1.eigenvec \
--file-geno-rds output/PGS_synthetic_LDpred2_inf/g1000_eur_chr21to22_hm3rnd1.rds \
file-keep-snps /REF/hapmap3/w_hm3.justrs \
chr2use 21,22
Which generates a shell script that can be run as
bash bash_scripts/ldpred2-inf-230918-12:26:35.sh # YYMMDD-HH:MM:SS is appended to file name
NOTE Replacing
--runtype 'sh'
with--runtype 'slurm'
and'ldpred2-inf'
by'ldpred2-auto'
generates a slurm jobscript using LDpred2-auto which can be submitted by issuingbash slurm_job_scripts/ldpred2-auto.job
(cf.pgs_exec_example_3.sh
)
The config.yaml
file defines some parameters for Slurm jobscripts, job environment, and PGS methods in a YAML file. The parameters defined throughout the file are:
Entries that will be put in the SLURM job scripts. These are:
slurm:
job_name: pgs #
account: p697_norment #
time: "00:30:00" # expected runtime
cpus_per_task: 4 # number of CPU cores per task
mem_per_cpu: 4000MB
partition: normal
# list of modules to load in SLURM jobs
module_load:
- singularity/3.7.3 # cf. the output of: "module spider singularity"
Control where the PGS tools are located, where the reference data is located, etc. These are:
environ:
# mandatory root directory containing all inferred directories (edit as necessary).
ROOT_DIR: "/nrec/space/espenh"
# dependent environment variables (edit as necessary)
# NB: "SIF" is mandatory
environ_inferred:
# folder containing full clone of https://github.com/comorment/containers
CONTAINERS: '$ROOT_DIR/containers'
# reference data within containers repo
REFERENCE: "$CONTAINERS/reference"
# directory with singularity containers (.sif files)
SIF: "$CONTAINERS/singularity"
# folder containing full clone of https://github.com/comorment/ldpred2_ref with LDpred2 reference data
LDPRED2_REF: "$ROOT_DIR/ldpred2_ref"
# folder containing LDpred2 R scripts
LDPRED2_SCRIPTS: "$CONTAINERS/scripts/pgs/LDpred2"
# for SINGULARITY_BIND variable to set in job scripts
# NB! will be set as "export SINGULARITY_BIND=value0:/key0,value1:/key1,..."
# NB! Also mandatory
SINGULARITY_BIND:
REF: '$REFERENCE'
ldpred2_ref: '$LDPRED2_REF'
ldpred2_scripts: '$LDPRED2_SCRIPTS'
Refer class documentation of each tool for details
used by class PGS_PLINK
plink:
clump_p1: 1
clump_r2: 0.1
clump_kb: 250
range_list: [0.001, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1]
strat_indep_pairwise: [250, 50, 0.25]
nPCs: 6
score_columns: [SNP, A1, BETA]
threads: 4
used by class PGS_PRSice2
prsice2:
MAF: 0.01
INFO: 0.8
nPCs: 6
thread: 4
used by class PGS_LDpred2
ldpred2:
nPCs: 6
cores: 4