Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendations about chunk sizes #22

Open
jeromekelleher opened this issue May 22, 2024 · 3 comments
Open

Recommendations about chunk sizes #22

jeromekelleher opened this issue May 22, 2024 · 3 comments

Comments

@jeromekelleher
Copy link
Contributor

We currently say nothing at all about chunk sizes, but I think we will need to provide some rules/guidance in order to make processing arrays efficient. For example, it really does help a lot of call-level arrays all have the same chunking (in the variants and samples dimension) so that code can read in (say) genotypes and DP values chunk-by-chunk in the same loop.

Currently vcf2zarr enforces a uniform chunk size across dimensions, so that we have one variants_chunk_size. While this is a useful simplification, it does have some drawbacks, particularly when we want to read in all of a low-dimensional array at once (e.g., ``variant_position). See #21 for discussion and some benchmarks on this point.

This would need some feedback from a variety of implementations and use-cases, I think.

@tomwhite
Copy link
Collaborator

Fields like variant_position and variant_contig are essentially coordinate indexes, so there is a case for storing them in a single chunk since all values need to be accessible at once. (Xarray for example reads all coordinates into memory.)

I don't know of any cases where they need to have the same chunking as other variant fields, but if there are any then it should be straighforward to rechunk to lots of chunks (easier than doing the reverse of reading lots of chunks into a single array as #21 showed).

@jeromekelleher, how easy would it be to adapt vcf2zarr to store these fields in single chunk Zarr arrays?

@jeromekelleher
Copy link
Contributor Author

jeromekelleher commented Jul 22, 2024

@jeromekelleher, how easy would it be to adapt vcf2zarr to store these fields in single chunk Zarr arrays?

Very easy. But let's consider the limit cases here first: what size array would we need to store an entire human genome where we've got calls at every base? (We're approaching this limit with large data dataset)

For 3.1Gb we get a variant_position array of around 12GB - so reading that in a single chunk just isn't feasible, and certainly not as a low-latency way of getting at small chunks of data. We will have to tackle the proper indexing of the (contig, position) values at some point (#21, #23), and I think it's probably best if we do so now to support vcztools view.

@tomwhite
Copy link
Collaborator

I agree. Let's use vcztools view to try different implementations - standardization can come later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants