Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Masking sample-wise #25

Closed
capoony opened this issue Jun 28, 2024 · 2 comments
Closed

Masking sample-wise #25

capoony opened this issue Jun 28, 2024 · 2 comments

Comments

@capoony
Copy link

capoony commented Jun 28, 2024

Hi Lucas,

me again, another important feature which would be very useful is to use sample-specific masking conditions. Global masking may be useful to mask TE's etc. However, individual libraries may be characterized by differences in Read Depths which may require more specific masking for the individual samples.

In our case, we have individual BED/MASK files (in FASTA format) for each sample and currently, I need to break the input sample-wise and run grenedalf for each sample separately, which is quite an effort for >700 samples.

Is there a more elegant way to do that, e.g. by reading the BED files for each sample first and create a matrix with window-wise masks for each samples which can then be used to calculate averages?

Cheers, Martin

@lczech
Copy link
Owner

lczech commented Jul 22, 2024

Hi Martin,

thanks for your patience, now getting back to working on grenedalf.

Great suggestion, and I'll get to implement this soon. I think a potential solution for this could be as follows: The masking as it is right now is merely another filter, where masked positions are not used in the downstream statistics computation. Any non-masked position however also undergoes any additional filters first (numerical etc, whatever the user provided), and then whatever remains after that is used for the statistic. That logic can easily be extended to per-sample masking as well, by simply having the mask do the same that the current global mask does, but on a per-sample bases. Any positions in a sample for which the sample mask tells us to not use the position are filtered out, any any that are not masked will then undergo all additional filters, and again, whatever remains after that will be used for the statistic. I think that would solve this feature request, right?

As for how to provide that: How about a simple two-column table file, mapping from sample name to mask file? That seems a bit easier than having users construct a matrix from their masks first.

Lastly, that all is I think independent of your other request (#24), which is about the window averaging. So, any masking per sample can be applied here first, and then the window average will be done on a global basis, so that it's the same denominator to get the window average for all samples. Or do you think having a per-sample denominator is needed as well? That would make it considerable more complex though, as in case of FST, that would need to be a per-sample-pair denominator, instad of a per-sample one.

So long
Lucas

lczech added a commit that referenced this issue Aug 1, 2024
@lczech
Copy link
Owner

lczech commented Aug 2, 2024

Hi Martin @capoony,

I just released grenedalf v0.6.0 which implements all of the above features. Let me know if this works for you, or if this does not solve your use case :-)

Cheers
Lucas

@lczech lczech closed this as completed Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants