Sub-pooling suggestions #33

alexis-sedg · 2024-10-09T01:27:32Z

Hello,

I'm interested in using this pipeline for pooled data. However, when we designed the study, we used the sub-pooling method recommended by CRISP (variant caller). So instead of having a singular BAM file for a given population, I have multiple. I assume I can use the downstream VCF or mpileup as part of your pipeline but I'd prefer to use the BAMs as inputs. Is there a way to go about using multiple BAMs for a given sample population in the Grenedalf pipeline?

Thank you,
Alexis

lczech · 2024-10-10T12:40:10Z

Hi Alexis @alexis-sedg,

interesting approach! Do you have a reference or link to that sub-pooling procedure? Why is that the recommendation for that variant caller?

Yes, grenedalf can do that, using the --sample-group-merge-table option that is provided for most of the commands. You could also merge the bam files into one bam per sample, e.g., with samtools merge if you want that instead. As you say, working on downstream VCFs or mpileups is not the best approach - VCFs are not well suited for pooled data in the first place, and pileup is just a waste of disk space as far as grenedalf is concerned.

Hope that helps, so long
Lucas

alexis-sedg · 2024-10-14T23:54:19Z

Hi Lucas,

Yeah, happily! The paper it's from is "A statistical method for the detection of variants from next-generation resequencing of DNA pools"
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881398/
or alternatively the GitHub
https://github.com/vibansal/crisp/tree/master

My understanding is that they use the comparison between multiple replicate pools of the same population to distinguish sequence errors from rare alleles. The explanation they provided was:
"In the absence of a variant, the frequency of the reads with a nucleotide different from the reference base at a particular position should be similar across multiple pools. The intuition being that sequencing errors, especially those that depend upon the local sequence context, are likely to be shared across reads in multiple pools. In contrast, presence of a rare variant in a pool is expected to result in an excess of reads with the alternate allele as compared with the other pools. We use a contingency table approach to compute a P-value for the null hypothesis in the absence of a SNP (see [Fig. 1] for an illustration of this idea)."

Excellent, I'll give the merge function a go! Read quality and quantity is variable across my data, even between samples from the same groups. Are there any additional considerations or recommendations you have to deal with the variability or is it alright to run the merge function as is?

Thanks for your time,
Alexis

lczech · 2024-10-21T12:45:58Z

Hi Alexis,

thanks for the details! Hm, interesting approach - however, I am not quite sure that the following is true in general:

The intuition being that sequencing errors, especially those that depend upon the local sequence context, are likely to be shared across reads in multiple pools.

That seems to be highly dependent on the sequencing technology being used, in the sense that (as far as I am aware as a non-wet-lab person) not all instruments have error profiles with dependencies on local sequence context. But well, if that method makes sense for your data, it seems reasonable :-)

We did in fact also work out the math for accounting for base qualities in the grenedalf statistical supplement, see the first of the two supplemental PDFs here. This is however currently not implemented, but at least in theory, it can be done. Just wanted to point it out to you in case that helps.

Also, CRISP seems to produce VCFs, but you said you are using BAMs? What is the process there?

As for the merge option: There is a caveat if you are using the diversity command, as that has a runtime that depends on the read depth of the samples. When merging samples, these can of course be way higher than what you would get from single sequencing runs, and so it can make sense to use subsampling as described here. If subsampling to a few thousands, that should not significantly change the results, but speed up the run time quite a bit. But maybe play around with that a bit to be sure that it does what you expect!

You also asked about considerations wrt the variablity - what do you mean there?

Hope that helps, so long
Lucas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-pooling suggestions #33

Sub-pooling suggestions #33

alexis-sedg commented Oct 9, 2024

lczech commented Oct 10, 2024 •

edited

Loading

alexis-sedg commented Oct 14, 2024

lczech commented Oct 21, 2024

Sub-pooling suggestions #33

Sub-pooling suggestions #33

Comments

alexis-sedg commented Oct 9, 2024

lczech commented Oct 10, 2024 • edited Loading

alexis-sedg commented Oct 14, 2024

lczech commented Oct 21, 2024

lczech commented Oct 10, 2024 •

edited

Loading