Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enumerate cohort-specific QC #12

Open
hammer opened this issue Feb 19, 2020 · 0 comments
Open

Enumerate cohort-specific QC #12

hammer opened this issue Feb 19, 2020 · 0 comments

Comments

@hammer
Copy link

hammer commented Feb 19, 2020

It has become clear from working through the GWAS tutorial, the UKBB paper, and discussions with @marivascruz that QC can be separated into standard tasks that can be applied to most cohorts and cohort-specific tasks. Importantly, some cohort-specific QC must be discovered through exploratory analysis of each cohort.

It seems that Plink is quite good for the standard QC steps, and a tool like Hail is most useful for the discovery and implementation of cohort-specific QC.

To better motivate the work we're doing in this repo, it would be useful to enumerate each category of QC task. It would be particularly helpful to have concrete examples of cohort-specific QC discovery, such as the ScatterShot-derived metric @marivascruz discussed on our call today.

In particular, @cseed has pointed us to the gnomAD team's QC efforts, and @marivascruz mentioned the BBJ did some cohort-specific QC.

For the BBJ, I've found The Biobank Japan project genotype data, and the methods section for Genome-wide association study identifies 112 new loci for body mass index in the Japanese population (2017). I couldn't find much useful information in these documents; here's their description of their GWAS QC:

For the quality control of GWAS, we excluded samples with a call rate ≤0.98. Closely related samples, which were estimated using identity by state (IBS), were excluded by visual inspection. We performed principal component analysis (PCA) for genotype using an in-house program based on the algorithm implemented by smartpca49, and we excluded outliers from the East Asian cluster. Finally, we calculated the Z-score for height by linear regression using age, sex, status of 47 diseases, and the top 10 principal components (PCs) and excluded individuals outside of ±4 s.d. for the purpose of quality control of the phenotype data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant