Using the CRAM readers, is there a way to read only blocks from specific data series #239

markjschreiber · 2024-02-29T19:57:25Z

markjschreiber
Feb 29, 2024

I would like to know if there is a way to iterate over the blocks of a CRAM file that belong to a specific data series so that when I only need information from one, or a few data series I can skip reading those blocks that are not relevant. Effectively read CRAM in a vertical (or columar) fashion.

Based on this image and my limited understanding of CRAM I think it would be possible to use the information in headers to seek to the appropriate offsets but I didn't see any obvious functions in the CRAM module that might do this.

If this is not currently supported, and if it might be possible, can you suggest some ideas on how I might implement this as I would love to add this kind of functionality.

Answered by zaeleus

Feb 29, 2024

Selection of data series is not supported in noodles-cram.

It's not apparent in your diagram, but the main issue is that data series are not guaranteed to be independent. See § 8.2.1 "Data blocks" (2023-05-22):

Please note that external blocks can have multiple data series associated with them; in this case the values from these data series will be interleaved.

This creates a constraint that naturally transposes the data into row-based records. It becomes even more complex if the reads are segmented, as some fields are required to be decoded to fully resolve all the records in a slice.

Sorry, I don't have a good solution to this problem.

View full answer

zaeleus · 2024-02-29T22:25:21Z

zaeleus
Feb 29, 2024
Maintainer

Selection of data series is not supported in noodles-cram.

It's not apparent in your diagram, but the main issue is that data series are not guaranteed to be independent. See § 8.2.1 "Data blocks" (2023-05-22):

Please note that external blocks can have multiple data series associated with them; in this case the values from these data series will be interleaved.

This creates a constraint that naturally transposes the data into row-based records. It becomes even more complex if the reads are segmented, as some fields are required to be decoded to fully resolve all the records in a slice.

Sorry, I don't have a good solution to this problem.

4 replies

markjschreiber Feb 29, 2024
Author

Ah, got it. So data in at least some blocks may be sequentially written by record rather than by series?

zaeleus Feb 29, 2024
Maintainer

Yes, most encoders will transform a group of records into data series sequentially by record. The codec selection will typically determine whether or not there will be interleave, but there's nothing stopping an encoder from reusing a particular data series either. This is why I mentioned that data series are not guaranteed to be independent.

You may be more interested in htslib's CRAM implementation. It does have a field selector (named required_fields in samtools), but I am not sure how it works internally or whether it's making assumptions about the data.

markjschreiber Feb 29, 2024
Author

Are there certain data series that are always written into core data blocks vs external data blocks or can that vary from file to file? From what I can see in the spec is that there

markjschreiber Feb 29, 2024
Author

The required_fields doc suggests that it avoids decoding of the fields that are not required which would save on CPU time but the bytes would still be read (and effectively thrown away). Although not exactly what I am looking for this might be a good compromise. In noodles, perhaps that would mean using some kind of Reader that can be configured to only bother decoding certain fields?

jkbonfield · 2024-03-01T09:20:42Z

jkbonfield
Mar 1, 2024

As the author of the htslib (originally io_lib) required_fields logic, I'll give some background.

CRAM can multiplex multiple data series into a single block (most commonly originally done into CORE, but it's also true for EXTERNAL blocks too). The original implementations worked this way infact. There can be some cases where this gives better compression too.
Generally however most encoders now write each data series to its own block. This permits partial decoding.
There is not however a one to one correlation between SAM columns and CRAM data series. For some it's easy (POS, QUAL, etc), but for others it's highly challenging. SEQ for example is a whole slew of different data series as the sequence is encoded very differently, mixing CIGAR and SEQ together plus an optional delta encoding process.

This means simply returning a SAM column isn't entirely trivial, but you could fake up a column by decoding only bits you want and stitching things together. Returning a CRAM column is obviously much easier and something the format could return (provided they're not multiplexed with something else), but it's sometimes of limited use.

So that background aside, how does required_fields work? The htslib API, if we can even call it that, is to present the user with a structure holding the raw BAM encoded bytes of an alignment record (uncompressed). It's inherently row-by-row serialised. Not efficient, but the API predates CRAM and we work with what we've got. However for strictly read-only operations, we may not need read names, quality scores, aux tags, etc (eg samtools flagstat). So we can direct CRAM to just invent default values for all the fields we're not interested in so the bam structure returned is compliant, but we've done less work.

The user asks for a SAM column, and we turn that into a nominal set of CRAM data series. Maybe 1, maybe many for SEQ.
We query the container headers to work out which data series are co-located with which other data series. If we asked for POS but for some odd reason that's in the CORE block along with DL (length of deletion), then we now have an additional dependency on decoding DL, as we have to disentangle one from the other. Note this can be encoding dependent too as some encodings have multiple elements which could be encoded in different places. Eg BYTE_ARRAY_LEN has a series for the data and another series for the lengths. There is no strict requirement they are co-located.
We also know the logic of the CRAM specification. We have no idea which CRAM records will be decoding DL in isolation. To work that out, we have to iterate over sequence features. That means we now have an additional dependency on FN (number of features) and FC (feature codes).
Implementation specific: my piece of code that reads these, happens to be the same code that is decoding the sequence in general and I don't have a way to decode some features and not other features as the required_fields granularity is too coarse. So I'll probably add in extra things like BA, BS, etc to the dependency list, simply because I know I cannot get to the code path that decodes FC, FN, DL without also decoding other series.
At the end of this, we have now expanded our list of data-series, but we need to check again whether they themselves have file format based dependencies due to co-locating with yet more series, so go back to step 2.

All of this is highly complex, but implemented in https://github.com/samtools/htslib/blob/develop/cram/cram_decode.c#L542-L868

The original version of this came from Staden io_lib, and at some point at least I had a hacked up encoder that could do random assignment of encodings and destinations, changing on every container. So I was fairly confident back then at least that it was robust. I'm less confident now that it hasn't bit-rotted and errors crept in, but the flip side is all modern CRAMs are avoiding cross-contamination of data-series by using discrete blocks for each one.

If I ever do a CRAM v4, I think this ought to be mandated (and remove the CORE block too), so this sort of logic doesn't have to be excessively tortuous.

1 reply

markjschreiber Mar 1, 2024
Author

Wow! That is complicated. I would also vote (as often as allowed) for strictly one block per series. It might make things a little larger in size but decoding would be much simpler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the CRAM readers, is there a way to read only blocks from specific data series #239

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using the CRAM readers, is there a way to read only blocks from specific data series #239

markjschreiber Feb 29, 2024

Replies: 2 comments · 5 replies

zaeleus Feb 29, 2024 Maintainer

markjschreiber Feb 29, 2024 Author

zaeleus Feb 29, 2024 Maintainer

markjschreiber Feb 29, 2024 Author

markjschreiber Feb 29, 2024 Author

jkbonfield Mar 1, 2024

markjschreiber Mar 1, 2024 Author

markjschreiber
Feb 29, 2024

Replies: 2 comments 5 replies

zaeleus
Feb 29, 2024
Maintainer

markjschreiber Feb 29, 2024
Author

zaeleus Feb 29, 2024
Maintainer

markjschreiber Feb 29, 2024
Author

markjschreiber Feb 29, 2024
Author

jkbonfield
Mar 1, 2024

markjschreiber Mar 1, 2024
Author