Skip to content

Latest commit

 

History

History
316 lines (231 loc) · 12.5 KB

README.md

File metadata and controls

316 lines (231 loc) · 12.5 KB

PGS toolset

More on PGS: https://choishingwan.github.io/PRS-Tutorial/

Codes

  • README.md: this file
  • pgs/pgs.py: Python class definitions setup to create bash commands for different PGS tools (plink, PRSice2, LDpred2, etc.)
  • run_pgs_synthetic.py: Python run script setup to run PGS tools on synthetic datas provided in this repository, mainly for testing purposes.
  • run_pgs_w_QC.py: Python run script setup to run PGS tools on datas provided in this repository, mainly for testing purposes. Performs basic QC steps.
  • run_pgs_MoBa_*.py: Python run scripts to run PGS tools on MoBa datas.
  • pgs_exec.py: gwas.py like command line tool that can be used to run, create bash and slurm job scripts for different PGS tools based on user input.
  • vis_pgs_synthetic.ipynb: Jupyter notebook plotting/comparing the output PGS scores generated by the run_pgs_synthetic.py script.
  • vis_pgs_w_QC.ipynb: Jupyter notebook plotting/comparing the output PGS scores generated by the run_pgs_w_QC.py script.
  • start_jupyter_server.py: start a Jupyter server using the python3.sif container (allows jupyter notebooks within VSCode with the Jupyter extension)
  • config.yaml: YAML file defining some parameters for Slurm jobscripts and PGS methods
  • pgs_exec_example_*.sh: Example bash scripts for pgs_exec.py
  • requirements.txt: Python package requirements. Install with pip install -r requirements.txt
  • Rscripts/*.R: misc. R scripts defining or being used by the PGS tools or optional QC steps.
  • tests/: directory for unit tests, executable with py.test -v tests in this directory

Requirements

The basic requirements for running these codes (sans project specific genomics data) are the same as for the rest of this project, i.e., a basic Python environment and a working Singularity/Apptainer installation.

Running the codes

Running these codes requires Python 3.8+ (tested/developed mainly using Python 3.9+) and a working Singularity/Apptainer installation.

Input files

To work with these codes, some input files are required. These are:

Summary statistics

GWAS summary files formatted according to the sumstats specification

Phenotypic data

Phenotypic data formatted according to the phenotype specification

Genotypic data

Genotypic data formatted according to the genotype specification

Covariate data

Covariate data formatted according to the covariate specification

Output files

The output will be written to a user-specified output/ directory, which is created if it does not exist. The output directory will contain subdirectories for each PGS method, e.g., output/PGS_synthetic_plink/ as specified in the runtime scripts.

PGS scores

The individ level output file shared by all PGS methods is named test.score. The text file contains the PGS scores for each individual in the phenotype file. The first two columns are FID and IID. The third column score is the PGS score.

NOTE: For PLINK and PRSice-2, the score column contains the "best" PGS score, i.e., the one with the highest R2 for the tested range of p-value thresholds. The scores for each p-value threshold are also written to the output directory as separate files.

Summary statistics

The file test_summary.txt contains the R (generalized) linear model (LM/GLM) summary statistics for the PGS model in plain-text format, while test_summary.csv contains the LM/GLM summary statistics in tabular (.csv) format. Both the full model and the null model are reported, typically assumed to be on the form

full model $$y \sim PGS_\mathrm{score} + PC_1 + PC_2 + ... + PC_n + SEX$$

null model $$y_\mathrm{null} \sim PC_1 + PC_2 + ... + PC_n + SEX$$

nocov model $$y_\mathrm{nocov} \sim PGS_\mathrm{score}$$

For binary traits, the GLM should use the binomial family and the logit link function to fit the model.

For binary traits, we also summarize Odds Ratios (OR) and 95% confidence intervals (CI) for the PGS models. These outputs are written to plaintext and tabular files named test_summary.or.txt and test_summary.<null/full/nocov>.or.csv, respectively.

Python runtime scripts

run_pgs_synthetic.py

Run PGS using PLINK, PRSice2 and LDpred2 on synthetic data provided in this repository, namely the files:

  • summary statistics: /REF/examples/ldpred2/trait1.sumstats.gz
  • phenotype data: /REF/examples/ldpred2/simu.pheno
  • genotype data: /REF/examples/ldpred2/g1000_eur_chr21to22_hm3rnd1.bed/bim/fam
  • covariate data: /REF/examples/prsice2/EUR.cov

NOTE: Files from a full clone of https://github.com/comorment/ldpred2_ref with LDpred2 reference data is required and should be located in a directory ldpred2_ref as defined in the config.yaml file.

Running the script:

$ python3 run_pgs_synthetic.py
environment variables in use:
        ROOT_DIR: /nrec/space/espenh
        CONTAINERS: /nrec/space/espenh/containers
...
# A tibble: 1 × 12
  r.squared adj.r.squ…¹ sigma stati…²   p.value    df logLik   AIC   BIC devia…³
      <dbl>       <dbl> <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.765       0.762 0.494    201. 3.47e-150     8  -354.  728.  770.    120.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
[1] "R2 (null model): 0.0250978564710413"
[1] "R2 (full model): 0.765387984851408"

NOTE: The script may require other packages than Python builtins, such as pandas and pyyaml. Install these by issuing pip install <package> in the current Python environment. There is also pip install -r requirements.txt

The output will be added to the output/PGS_synthetic_<method>/ directories.

run_pgs_w_QC.py

Run PGS using PLINK, PRSice2 and LDpred2 on tutorial data provided in this repository, including some basic QC steps on the data based on suggestions from this tutorial.

The main input files are:

  • summary statistics: /REF/examples/prsice2/Height.gwas.txt.gz
  • phenotype data: /REF/examples/prsice2/EUR.height
  • genotype data: /REF/examples/prsice2/EUR.bed/bim/fam
  • covariates: /REF/examples/prsice2/EUR.cov
  • eigenvectors: /REF/examples/prsice2/EUR.eigenvec

NOTE: Files from a full clone of https://github.com/comorment/ldpred2_ref with LDpred2 reference data is required and should be located in a directory ldpred2_ref as defined in the config.yaml file.

Running the script:

python3 run_pgs_w_QC.py

run_pgs_MoBa_*.py

Run basic PGS for height using PLINK, PRSice2 and LDpred2, respectively on MoBa child sample data (only on TSD p697).

pgs_exec.py command line tool

Run, create bash and slurm job scripts for different PGS tools

Get a list of options:

$ python3 pgs_exec.py --help
usage: PGS [-h] [--method {plink,prsice2,ldpred2-inf,ldpred2-auto}] [--config CONFIG] [--sumstats-file SUMSTATS_FILE] [--pheno-file PHENO_FILE] [--phenotype PHENOTYPE] [--phenotype-class {CONTINUOUS,BINARY}] [--geno-file-prefix GENO_FILE_PREFIX] [--output-dir OUTPUT_DIR] [--runtype {sh,slurm,subprocess}]
           {plink,prsice2,ldpred2-inf,ldpred2-auto} ...

A pipeline for PGS analysis

positional arguments:
  {plink,prsice2,ldpred2-inf,ldpred2-auto}

optional arguments:
  -h, --help            show this help message and exit
  --method {plink,prsice2,ldpred2-inf,ldpred2-auto}
                        Method for PGS
  --config CONFIG       config YAML file
  --sumstats-file SUMSTATS_FILE
                        summary statistics file
  --pheno-file PHENO_FILE
                        phenotype file
  --phenotype PHENOTYPE
                        phenotype name (must be a column header in ``pheno_file``)
  --phenotype-class {CONTINUOUS,BINARY}
                        phenotype class
  --geno-file-prefix GENO_FILE_PREFIX
                        file path to .bed, .bim, .fam, etc. files
  --output-dir OUTPUT_DIR
                        Output file directory
  --runtype {sh,slurm,subprocess}
                        operation mode

Example w. PRSice2 as subprocess on synthetic dataset (pgs_exec_example_1.sh):

python3 pgs_exec.py \
    --sumstats-file /REF/examples/ldpred2/trait1.sumstats.gz \
    --pheno-file /REF/examples/ldpred2/simu.pheno \
    --phenotype trait1 \
    --phenotype-class CONTINUOUS \
    --geno-file-prefix /REF/examples/ldpred2/g1000_eur_chr21to22_hm3rnd1 \
    --output-dir output/PGS_synthetic_prsice2 \
    --runtype subprocess \
    prsice2 \
    --covariate-file /REF/examples/prsice2/EUR.cov \
    --eigenvec-file output/PGS_synthetic_prsice2/g1000_eur_chr21to22_hm3rnd1.eigenvec

NOTE: The last two lines will override settings for method: prsice2 in config.yaml file, being parsed to the PRSice2.r script

Example w. LDpred2-inf via shell (sh) script on synthetic dataset (pgs_exec_example_2.sh):

python3 pgs_exec.py \
    --sumstats-file /REF/examples/ldpred2/trait1.sumstats.gz \
    --pheno-file /REF/examples/ldpred2/simu.pheno \
    --phenotype trait1 \
    --phenotype-class CONTINUOUS \
    --geno-file-prefix /REF/examples/ldpred2/g1000_eur_chr21to22_hm3rnd1 \
    --output-dir output/PGS_synthetic_LDpred2_inf \
    --runtype sh \
    ldpred2-inf \
    --covariate-file /REF/examples/prsice2/EUR.cov \
    --eigenvec-file output/PGS_synthetic_LDpred2_inf/g1000_eur_chr21to22_hm3rnd1.eigenvec \
    --file-geno-rds output/PGS_synthetic_LDpred2_inf/g1000_eur_chr21to22_hm3rnd1.rds \
    file-keep-snps /REF/hapmap3/w_hm3.justrs \
    chr2use 21,22

Which generates a shell script that can be run as

bash bash_scripts/ldpred2-inf-230918-12:26:35.sh  # YYMMDD-HH:MM:SS is appended to file name 

NOTE Replacing --runtype 'sh' with --runtype 'slurm' and 'ldpred2-inf' by 'ldpred2-auto' generates a slurm jobscript using LDpred2-auto which can be submitted by issuing bash slurm_job_scripts/ldpred2-auto.job (cf. pgs_exec_example_3.sh)

Config file

The config.yaml file defines some parameters for Slurm jobscripts, job environment, and PGS methods in a YAML file. The parameters defined throughout the file are:

parameters to pass for SLURM jobs

Entries that will be put in the SLURM job scripts. These are:

slurm:
  job_name: pgs  # 
  account: p697_norment  # 
  time: "00:30:00"  # expected runtime
  cpus_per_task: 4  # number of CPU cores per task
  mem_per_cpu: 4000MB
  partition: normal

  # list of modules to load in SLURM jobs
  module_load:
    - singularity/3.7.3  # cf. the output of: "module spider singularity"

environment variables (edit as necessary)

Control where the PGS tools are located, where the reference data is located, etc. These are:

environ:
  # mandatory root directory containing all inferred directories (edit as necessary).
  ROOT_DIR: "/nrec/space/espenh"

# dependent environment variables (edit as necessary)
# NB: "SIF" is mandatory
environ_inferred:
  # folder containing full clone of https://github.com/comorment/containers
  CONTAINERS: '$ROOT_DIR/containers'
  # reference data within containers repo
  REFERENCE: "$CONTAINERS/reference"
  # directory with singularity containers (.sif files)
  SIF: "$CONTAINERS/singularity"
  # folder containing  full clone of https://github.com/comorment/ldpred2_ref with LDpred2 reference data
  LDPRED2_REF: "$ROOT_DIR/ldpred2_ref"
  # folder containing LDpred2 R scripts
  LDPRED2_SCRIPTS: "$CONTAINERS/scripts/pgs/LDpred2"

# for SINGULARITY_BIND variable to set in job scripts
# NB! will be set as "export SINGULARITY_BIND=value0:/key0,value1:/key1,..."
# NB! Also mandatory
SINGULARITY_BIND: 
  REF: '$REFERENCE'
  ldpred2_ref: '$LDPRED2_REF'
  ldpred2_scripts: '$LDPRED2_SCRIPTS'

Parameters specific to each PGS calculating tool

Refer class documentation of each tool for details

Plink

used by class PGS_PLINK

plink:
  clump_p1: 1
  clump_r2: 0.1
  clump_kb: 250
  range_list: [0.001, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 1]
  strat_indep_pairwise: [250, 50, 0.25]
  nPCs: 6
  score_columns: [SNP, A1, BETA]
  threads: 4

PRSice-2

used by class PGS_PRSice2

prsice2:
  MAF: 0.01
  INFO: 0.8
  nPCs: 6
  thread: 4

LDpred2

used by class PGS_LDpred2

ldpred2:
  nPCs: 6
  cores: 4