Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-pooling suggestions #33

Open
alexis-sedg opened this issue Oct 9, 2024 · 3 comments
Open

Sub-pooling suggestions #33

alexis-sedg opened this issue Oct 9, 2024 · 3 comments

Comments

@alexis-sedg
Copy link

Hello,

I'm interested in using this pipeline for pooled data. However, when we designed the study, we used the sub-pooling method recommended by CRISP (variant caller). So instead of having a singular BAM file for a given population, I have multiple. I assume I can use the downstream VCF or mpileup as part of your pipeline but I'd prefer to use the BAMs as inputs. Is there a way to go about using multiple BAMs for a given sample population in the Grenedalf pipeline?

Thank you,
Alexis

@lczech
Copy link
Owner

lczech commented Oct 10, 2024

Hi Alexis @alexis-sedg,

interesting approach! Do you have a reference or link to that sub-pooling procedure? Why is that the recommendation for that variant caller?

Yes, grenedalf can do that, using the --sample-group-merge-table option that is provided for most of the commands. You could also merge the bam files into one bam per sample, e.g., with samtools merge if you want that instead. As you say, working on downstream VCFs or mpileups is not the best approach - VCFs are not well suited for pooled data in the first place, and pileup is just a waste of disk space as far as grenedalf is concerned.

Hope that helps, so long
Lucas

@alexis-sedg
Copy link
Author

Hi Lucas,

Yeah, happily! The paper it's from is "A statistical method for the detection of variants from next-generation resequencing of DNA pools"
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881398/
or alternatively the GitHub
https://github.com/vibansal/crisp/tree/master

My understanding is that they use the comparison between multiple replicate pools of the same population to distinguish sequence errors from rare alleles. The explanation they provided was:
"In the absence of a variant, the frequency of the reads with a nucleotide different from the reference base at a particular position should be similar across multiple pools. The intuition being that sequencing errors, especially those that depend upon the local sequence context, are likely to be shared across reads in multiple pools. In contrast, presence of a rare variant in a pool is expected to result in an excess of reads with the alternate allele as compared with the other pools. We use a contingency table approach to compute a P-value for the null hypothesis in the absence of a SNP (see [Fig. 1] for an illustration of this idea)."

Excellent, I'll give the merge function a go! Read quality and quantity is variable across my data, even between samples from the same groups. Are there any additional considerations or recommendations you have to deal with the variability or is it alright to run the merge function as is?

Thanks for your time,
Alexis

@lczech
Copy link
Owner

lczech commented Oct 21, 2024

Hi Alexis,

thanks for the details! Hm, interesting approach - however, I am not quite sure that the following is true in general:

The intuition being that sequencing errors, especially those that depend upon the local sequence context, are likely to be shared across reads in multiple pools.

That seems to be highly dependent on the sequencing technology being used, in the sense that (as far as I am aware as a non-wet-lab person) not all instruments have error profiles with dependencies on local sequence context. But well, if that method makes sense for your data, it seems reasonable :-)

We did in fact also work out the math for accounting for base qualities in the grenedalf statistical supplement, see the first of the two supplemental PDFs here. This is however currently not implemented, but at least in theory, it can be done. Just wanted to point it out to you in case that helps.

Also, CRISP seems to produce VCFs, but you said you are using BAMs? What is the process there?

As for the merge option: There is a caveat if you are using the diversity command, as that has a runtime that depends on the read depth of the samples. When merging samples, these can of course be way higher than what you would get from single sequencing runs, and so it can make sense to use subsampling as described here. If subsampling to a few thousands, that should not significantly change the results, but speed up the run time quite a bit. But maybe play around with that a bit to be sure that it does what you expect!

You also asked about considerations wrt the variablity - what do you mean there?

Hope that helps, so long
Lucas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants