Recommendations about chunk sizes #22

jeromekelleher · 2024-05-22T10:49:33Z

We currently say nothing at all about chunk sizes, but I think we will need to provide some rules/guidance in order to make processing arrays efficient. For example, it really does help a lot of call-level arrays all have the same chunking (in the variants and samples dimension) so that code can read in (say) genotypes and DP values chunk-by-chunk in the same loop.

Currently vcf2zarr enforces a uniform chunk size across dimensions, so that we have one variants_chunk_size. While this is a useful simplification, it does have some drawbacks, particularly when we want to read in all of a low-dimensional array at once (e.g., ``variant_position). See #21 for discussion and some benchmarks on this point.

This would need some feedback from a variety of implementations and use-cases, I think.

The text was updated successfully, but these errors were encountered:

tomwhite · 2024-07-19T11:50:13Z

Fields like variant_position and variant_contig are essentially coordinate indexes, so there is a case for storing them in a single chunk since all values need to be accessible at once. (Xarray for example reads all coordinates into memory.)

I don't know of any cases where they need to have the same chunking as other variant fields, but if there are any then it should be straighforward to rechunk to lots of chunks (easier than doing the reverse of reading lots of chunks into a single array as #21 showed).

@jeromekelleher, how easy would it be to adapt vcf2zarr to store these fields in single chunk Zarr arrays?

jeromekelleher · 2024-07-22T10:09:34Z

@jeromekelleher, how easy would it be to adapt vcf2zarr to store these fields in single chunk Zarr arrays?

Very easy. But let's consider the limit cases here first: what size array would we need to store an entire human genome where we've got calls at every base? (We're approaching this limit with large data dataset)

For 3.1Gb we get a variant_position array of around 12GB - so reading that in a single chunk just isn't feasible, and certainly not as a low-latency way of getting at small chunks of data. We will have to tackle the proper indexing of the (contig, position) values at some point (#21, #23), and I think it's probably best if we do so now to support vcztools view.

tomwhite · 2024-07-22T10:17:25Z

I agree. Let's use vcztools view to try different implementations - standardization can come later.

jeromekelleher mentioned this issue Jul 8, 2024

Basic regions support sgkit-dev/vcztools#16

Closed

Will-Tyler mentioned this issue Sep 27, 2024

vcztools query: optimize query evaluation sgkit-dev/vcztools#84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendations about chunk sizes #22

Recommendations about chunk sizes #22

jeromekelleher commented May 22, 2024

tomwhite commented Jul 19, 2024

jeromekelleher commented Jul 22, 2024 •

edited

Loading

tomwhite commented Jul 22, 2024

Recommendations about chunk sizes #22

Recommendations about chunk sizes #22

Comments

jeromekelleher commented May 22, 2024

tomwhite commented Jul 19, 2024

jeromekelleher commented Jul 22, 2024 • edited Loading

tomwhite commented Jul 22, 2024

jeromekelleher commented Jul 22, 2024 •

edited

Loading