birneylab/flexlmm is a bioinformatics pipeline that runs linear mixed models for Genome-Wide Association Studies. It is not particularly fast or different from other tools, but it is very flexible in the definition of the statistical model to be used and it performs permutations to correct for multiple testing and non-normal phenotypes. p-values are evaluated with a likelyhood ratio test between a 'null model' and a 'real model'. Both models are specified by the user with the R formula interface.
Disclaimer: this pipeline uses the nf-core template but it is not part of nf-core itself.
- Convert vcf genotypes to
pgen
format (plink2
) - Compute the relatedness matrix for the whole genome and each LOCO subset (
plink2
) - Verify that the statistical model specified is nested in the null model (
R language
) - Estimate variance components using the null model fixed effects and the relatedness matrix (
gaston
) - Compute the Cholesky decomposition of the phenotype variance-covariance matrix (
R language
) - Remove the covariance structure from the phenotypes and fixed effect covariates (
R language
) - Fit the null and complete models for each SNP, and compute a p-value using a likelyhood ratio test (
R language
) - Fit the model and compute p-values for each permutation of the genotypes (
R language
) - Compute the significance threshold using the Westfall–Young minP approach (
R language
) - Make the final plots (
ggplot2
):- Manhattan plot of the associations
- Quantile-quantile plots
- Heatmap of the relatedness matrices (
ComplexHeatmap
)
- Permutations are reproducible since the permutation index is used as a random seed
Plink2
used wherever possible- Model fitting done by loading in memory only one SNP at a time directly from the
pgen
file and using low-level internal R routines to increase performance - Can test for dominance and interactions of the genotype with fixed covariates (for example, to test for GxE and GxG)
- Can standardize or quantile-normalize the phenotypes
- Can include quantitative and/or categorical covariates
- Permutations can be done within subgroups specified by a categorical covariate
- This is useful when distinct sub-populations are present, such as in the case of multiple F2 crosses
- Quantitative phenotypes (case-control studies)
The central concept of this pipeline is that of testing weather a certain statistical model (the 'real model') is significantly better than another statistical model (the 'null model').
The ratio of likelyhoods of the 2 models follows a
In this pipeline both models are supplied by the user in the form of R formulas using the parameters null_model_formula
and model_formula
.
In the formulas is possible to refer to the genotype of a given SNP with the term x
, and to the phenotype with the term y
.
An intercept is automatically included but can be removed by adding the term 0
or -1
to the models.
Derived quantities such as a dominance term (i.e. x==1
for genotypes encoded as 0,1,2) can be used but must be enclosed in parenthesis.
Names of the columns in the covariate and quantitative covariate file can also be used.
Consult the parameter documentation to see how these files should be formatted.
A valid real model and null model formula pair is for example:
model_formula: y ~ x + (x==1) + x:cov1 + cov1
null_model_formula: y ~ cov1
Here, cov1
is the name of one of the columns of the file specified via --covar
or --qcovar
.
The p-value that you will obtain in this case will indicate if any of the terms x
, (x==1)
, and x:cov1
have an effect significantly different from 0, when an intercept and cov1
are accounted for.
The following formulas instead are NOT valid:
model_formula: y ~ x + (x==1) + x:cov1
null_model_formula: y ~ cov1
This is because the null model contains the term cov1
which is not present in the model, and thus the formulas are not nested.
This pipeline fits a Leave-One-Chromosome-Out (LOCO) mixed model using the 'null model' terms as fixed effects and the realised relatedness matrix as correlation matrix to estimate the variance components. The variance components are then used to estimate the variance-covariance matrix of the phenotypes. The square root of the variance-covariance matrix is determined via Cholesky decomposition, and this decomposition is used to rotate the response vector and the design matrix for each 'real model' before running Ordinary Least Squares (OLS). Except for the fact that the fixed effect SNPs are not included in the variance components estimation (for performance reasons), fitting OLS to the rotated response and design matrix is mathematically equivalent to fitting Generalised Least Squares (i.e. fitting a mixed model) to the original response and design matrix. This is a fairly standard approach used for example here.
Permutations are run on the genotype vectors jointly, so that each genotype is permuted in the same way and linkage disequilibrium is maintained. Since phenotypes and covariates are not permuted, also their relationship is not altered. Genotype permutations are performed AFTER the Cholesky rotation, so that the relatedness structure is regressed out from the correct samples and only the residuals from this operation are permuted. This corresponds to a null hypothesis where the exchangeable quantities are the genotype identities when relatedness is accounted already for. See here for an example of this approach being used in practice.
Significance thresholds for a given nominal significance level are reported using Bonferroni correction and Westfall–Young permutations (see here).
If
Integration with birneylab/stitchimpute
In order to use a vcf file obtained from the birneylab/stitchimpute pipeline, activate the stitch
profile with the flag -profile stitch
.
This correctly loads the dosage information and fills missing genotypes.
For ease of use, the ideal settings for stitch for medaka samples have been specified in a profile called medaka
.
This can be activated with the flag -profile medaka
.
Always use this profile when working with medaka samples.
If the genotypes that you are using have been obtained with the birneylab/stitchimpute pipeline, you should also use the stitch
profile.
In this case the flag to be used is -profile medaka,stitch
Note If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with
-profile test
before running the workflow on actual data.
Since just a few files are required to run this pipeline, differently from other pipelines a samplesheet is not used. Instead, the required files are all specified with dedicated parameters.
You can run the pipeline using:
nextflow run birneylab/stitchimpute \
-profile <docker/singularity/.../institute> \
--vcf input.vcf.gz \
--pheno input.pheno \
--model_formula 'y ~ x' \
--null_model_formula 'y ~ 1' \
--outdir <OUTDIR>
Warning: Please provide pipeline parameters via the CLI or Nextflow
-params-file
option. Custom config files including those provided by the-c
Nextflow option can be used to provide any configuration except for parameters; see docs.
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
Warning: It is highly recommended to use the docker or singularity profile. Some processes do not have a working conda configuration.
For more details about the output files and reports, please refer to the output documentation.
birneylab/flexlmm was originally written by Saul Pierotti.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
You can cite the nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.