-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommendations about chunk sizes #22
Comments
Fields like I don't know of any cases where they need to have the same chunking as other variant fields, but if there are any then it should be straighforward to rechunk to lots of chunks (easier than doing the reverse of reading lots of chunks into a single array as #21 showed). @jeromekelleher, how easy would it be to adapt vcf2zarr to store these fields in single chunk Zarr arrays? |
Very easy. But let's consider the limit cases here first: what size array would we need to store an entire human genome where we've got calls at every base? (We're approaching this limit with large data dataset) For 3.1Gb we get a |
I agree. Let's use vcztools view to try different implementations - standardization can come later. |
We currently say nothing at all about chunk sizes, but I think we will need to provide some rules/guidance in order to make processing arrays efficient. For example, it really does help a lot of call-level arrays all have the same chunking (in the variants and samples dimension) so that code can read in (say) genotypes and DP values chunk-by-chunk in the same loop.
Currently vcf2zarr enforces a uniform chunk size across dimensions, so that we have one
variants_chunk_size
. While this is a useful simplification, it does have some drawbacks, particularly when we want to read in all of a low-dimensional array at once (e.g., ``variant_position). See #21 for discussion and some benchmarks on this point.This would need some feedback from a variety of implementations and use-cases, I think.
The text was updated successfully, but these errors were encountered: