Masking sample-wise #25

capoony · 2024-06-28T10:43:24Z

Hi Lucas,

me again, another important feature which would be very useful is to use sample-specific masking conditions. Global masking may be useful to mask TE's etc. However, individual libraries may be characterized by differences in Read Depths which may require more specific masking for the individual samples.

In our case, we have individual BED/MASK files (in FASTA format) for each sample and currently, I need to break the input sample-wise and run grenedalf for each sample separately, which is quite an effort for >700 samples.

Is there a more elegant way to do that, e.g. by reading the BED files for each sample first and create a matrix with window-wise masks for each samples which can then be used to calculate averages?

Cheers, Martin

lczech · 2024-07-22T10:31:02Z

Hi Martin,

thanks for your patience, now getting back to working on grenedalf.

Great suggestion, and I'll get to implement this soon. I think a potential solution for this could be as follows: The masking as it is right now is merely another filter, where masked positions are not used in the downstream statistics computation. Any non-masked position however also undergoes any additional filters first (numerical etc, whatever the user provided), and then whatever remains after that is used for the statistic. That logic can easily be extended to per-sample masking as well, by simply having the mask do the same that the current global mask does, but on a per-sample bases. Any positions in a sample for which the sample mask tells us to not use the position are filtered out, any any that are not masked will then undergo all additional filters, and again, whatever remains after that will be used for the statistic. I think that would solve this feature request, right?

As for how to provide that: How about a simple two-column table file, mapping from sample name to mask file? That seems a bit easier than having users construct a matrix from their masks first.

Lastly, that all is I think independent of your other request (#24), which is about the window averaging. So, any masking per sample can be applied here first, and then the window average will be done on a global basis, so that it's the same denominator to get the window average for all samples. Or do you think having a per-sample denominator is needed as well? That would make it considerable more complex though, as in case of FST, that would need to be a per-sample-pair denominator, instad of a per-sample one.

So long
Lucas

lczech · 2024-08-02T01:12:01Z

Hi Martin @capoony,

I just released grenedalf v0.6.0 which implements all of the above features. Let me know if this works for you, or if this does not solve your use case :-)

Cheers
Lucas

lczech added a commit that referenced this issue Aug 1, 2024

Add sample mask options #25

fec44b4

lczech closed this as completed Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masking sample-wise #25

Masking sample-wise #25

capoony commented Jun 28, 2024

lczech commented Jul 22, 2024

lczech commented Aug 2, 2024

Masking sample-wise #25

Masking sample-wise #25

Comments

capoony commented Jun 28, 2024

lczech commented Jul 22, 2024

lczech commented Aug 2, 2024