Using the CRAM readers, is there a way to read only blocks from specific data series #239
-
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Selection of data series is not supported in noodles-cram. It's not apparent in your diagram, but the main issue is that data series are not guaranteed to be independent. See § 8.2.1 "Data blocks" (2023-05-22):
This creates a constraint that naturally transposes the data into row-based records. It becomes even more complex if the reads are segmented, as some fields are required to be decoded to fully resolve all the records in a slice. Sorry, I don't have a good solution to this problem. |
Beta Was this translation helpful? Give feedback.
-
As the author of the htslib (originally io_lib)
This means simply returning a SAM column isn't entirely trivial, but you could fake up a column by decoding only bits you want and stitching things together. Returning a CRAM column is obviously much easier and something the format could return (provided they're not multiplexed with something else), but it's sometimes of limited use. So that background aside, how does
All of this is highly complex, but implemented in https://github.com/samtools/htslib/blob/develop/cram/cram_decode.c#L542-L868 The original version of this came from Staden io_lib, and at some point at least I had a hacked up encoder that could do random assignment of encodings and destinations, changing on every container. So I was fairly confident back then at least that it was robust. I'm less confident now that it hasn't bit-rotted and errors crept in, but the flip side is all modern CRAMs are avoiding cross-contamination of data-series by using discrete blocks for each one. If I ever do a CRAM v4, I think this ought to be mandated (and remove the CORE block too), so this sort of logic doesn't have to be excessively tortuous. |
Beta Was this translation helpful? Give feedback.
Selection of data series is not supported in noodles-cram.
It's not apparent in your diagram, but the main issue is that data series are not guaranteed to be independent. See § 8.2.1 "Data blocks" (2023-05-22):
This creates a constraint that naturally transposes the data into row-based records. It becomes even more complex if the reads are segmented, as some fields are required to be decoded to fully resolve all the records in a slice.
Sorry, I don't have a good solution to this problem.