Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter-sample-max-read-depth options #36

Open
ribeirots opened this issue Oct 18, 2024 · 3 comments
Open

filter-sample-max-read-depth options #36

ribeirots opened this issue Oct 18, 2024 · 3 comments

Comments

@ribeirots
Copy link

Would it be possible to have the max read depth be a percentile? For example, set it to 0.98 to exclude reads with depth in the top 2% highest coverage. Perhaps, if the value passed to the parameter is between 0 and 1 it could take it as percentile by default. Alternatively, could I pass different integers for each sample, similar to how I can pass different pool sizes? This way I could calculate the percentile myself before running the program.

@lczech
Copy link
Owner

lczech commented Oct 21, 2024

Hi @ribeirots,

thanks for the suggestion - let's see if I am getting this correctly:

Would it be possible to have the max read depth be a percentile? For example, set it to 0.98 to exclude reads with depth in the top 2% highest coverage. Perhaps, if the value passed to the parameter is between 0 and 1 it could take it as percentile by default.

I am not sure that I understand exactly what you mean by: "exclude reads with depth in the top 2% highest coverage" - top 2% of what exactly? As far as I can make sense of this, you would like to have a threshold per sample, across all it's positions. This would require a complete reading of the data first to figure out the max read depth per sample, then set a 98% threshold based on that, and then reading the data again to apply it and compute the actual thing we want based on that filter. That double reading seems like a rather large downside, but I can see this generally being a useful way to specify filters.

Alternatively, could I pass different integers for each sample, similar to how I can pass different pool sizes? This way I could calculate the percentile myself before running the program.

That is definitely more efficient, as it would not require two passes through the data. Well, yes, once initially to calculate the percentile for each sample. But it would not require two passes every time then if you need to run grenedalf again, for instance to experiment with the effect of other settings. So that definitely sounds interesting, and might also be applicable to the other per-sample filter settings. I can see a bit of a potential for accidental issues with this as the filtering would not be consistent between samples though. Maybe @jeff_spence has some thoughts on this?

Please let me know if this fits with what you had in mind. Unfortunately, I am a bit short on time for working on grenedalf at the moment, but we can keep this issue open for when I get back to working on it. So, not sure that this will be available any time soon. In the meantime, to achieve the same, you can pre-filter your data manually. For instance, you can use the frequency command to get the per-sample read depth, and then run the sync command for each sample to apply a max threshold per sample.

Hope that helps, cheers
Lucas

@ribeirots
Copy link
Author

Thank you, Lucas. Yes, I think you understood my suggestion correctly. It is a feature that is implemented in PoPoolation, too - that is where I first saw this kind of filter. I am doing the manual pre-filter to achieve this filter for now.

@lczech
Copy link
Owner

lczech commented Oct 21, 2024

Hi @ribeirots,

all right, just checked how PoPoolation2 is doing this - they indeed read the whole input data just to get those thresholds to get the value for a percentage of --max-coverage. As said, that is rather wasteful when having to process large amounts of files multiple times, so I think the option to allow specifying separate values for each sample, along with a command to compute this.

Also, PoPoolation2 has a percentage option for the --min-count, but that seems to operate on a per-position basis, so it would not need two passes through the data. The --min-coverage however is always applied for all samples, and without percentages. I am not quite sure if that is a statistical choice to achieve consistency though.

Let's see. Right now I am a bit short on time, but I'll hopefully be able to implement this eventually. It seems useful!

Thanks again
Lucas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants