Add Norway spruce dataset case study #169

jeromekelleher · 2024-09-27T09:05:27Z

@percyfal has applied bio2zarr to a spruce dataset, and it would be an excellent addition to the paper as a further case study.

A notable thing about Norway spruce is the genome size (19Gbp). AFAIK, this means that individual chromosomes overflow the VCF limit of 32 bit integers. As well the usual discussion about compression etc here it would be nice to show that we can overcome these limits by updating the variant_position and contig arrays in place, so that the true genome coordinates are used, and we can show that downstream tools can work on this, unmodified.

This is also a good place to demonstrate the one-big-whole genome Zarr I think?

@percyfal can you update here with details about the dataset, runtimes and the output of vcf2zarr inspect please?

The text was updated successfully, but these errors were encountered:

percyfal · 2024-09-27T12:03:44Z

Norway spruce has 12 chromosomes totaling some 19Gbp. The assembly consists of more contigs, but I only focus on the main chromosome-level autosomes here. For reasons related to the overflow Jerome mentioned, the VCF files have been generated in 100Mbp chunks over each chromosome, such that there are 160+ files in total. It would indeed be beneficial if we could work in real chromosome coordinates instead of as now having convert between different systems.

I attach the output of vcf2zarr inspect for one test run for both ICF and VCZ. I will provide more information on runtimes when I process the data. Suffice to say that using 128 cores (256 threads) it took only a couple of minutes to convert a 50GiB VCF to ICF and VCZ.

PA_chr01_1.vcz.inspect.txt
PA_chr01_1.icf.inspect.txt

jeromekelleher · 2024-09-27T12:11:34Z

That's fantastic @percyfal! Could we try doing all 160 files at once?? That might need distributing over the cluster, or if you're happy to wait, a few days on your big machine.

percyfal · 2024-09-27T12:23:07Z

I'll see if I can get this going before weekend, the bottleneck is more likely the conversion of gzipped VCFs to bgzip.

percyfal · 2024-09-27T12:26:28Z

Just so I'm clear: when you say all VCFs in one go, you mean making one ICF store that then is encoded as VCZ?

jeromekelleher · 2024-09-27T15:05:39Z

Yeah, so convert all the VCFs to ICF at the same time using bio2zarr explode [VCF1 VCF2...]. If that's too much work at once, we could just use all the VCFs for one chromosome. It's much easier to stitch the coordinates for a chromosome together if we convert to a single Zarr.

There's no rush at all.

percyfal · 2024-09-30T13:45:32Z

I ran vcf2zarr explode and vcf2zarr encode on 165 input VCF files (7.4T) consisting of variant calls on 100Mbp-subsets of 12 chromosomes and >1000 samples. Computations were done on a cluster partition that had 1832960 MB RAM, 128 cores (256 threads). Commands were submitted with snakemake which recorded benchmarks of runs.

vcf2zarr explode:

s       h:m:s   max_rss max_vms max_uss max_pss io_in   io_out  mean_load       cpu_time
49544.4538      13:45:44        897779.69       6211026.19      896217.78       896223.65       477032.95       7101523.05      18184.83        9018678.00

vcf2zarr encode:

s       h:m:s   max_rss max_vms max_uss max_pss io_in   io_out  mean_load       cpu_time
67886.8015      18:51:26        212722.57       5525147.89      208617.44       208634.48       7116127.48      6929164.32      21186.84        14385484.06

vcf2zarr inspect icf:
spruce.icf.inspect.txt
vcf2zarr inspect vcz:
spruce.vcz.inspect.txt

jeromekelleher · 2024-09-30T13:51:13Z

Awesome! How are currently dealing with the coordinate overflow problem in VCF? It would be nice to describe how to fix this up, and how long it takes to run (should be a fraction of a second).

jeromekelleher mentioned this issue Sep 27, 2024

Add Per Unneberg to author list #170

Closed

percyfal mentioned this issue Nov 14, 2024

Add subsection on spruce case study (DRAFT) #176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Norway spruce dataset case study #169

Add Norway spruce dataset case study #169

jeromekelleher commented Sep 27, 2024

percyfal commented Sep 27, 2024

jeromekelleher commented Sep 27, 2024

percyfal commented Sep 27, 2024

percyfal commented Sep 27, 2024

jeromekelleher commented Sep 27, 2024

percyfal commented Sep 30, 2024

jeromekelleher commented Sep 30, 2024

Add Norway spruce dataset case study #169

Add Norway spruce dataset case study #169

Comments

jeromekelleher commented Sep 27, 2024

percyfal commented Sep 27, 2024

jeromekelleher commented Sep 27, 2024

percyfal commented Sep 27, 2024

percyfal commented Sep 27, 2024

jeromekelleher commented Sep 27, 2024

percyfal commented Sep 30, 2024

jeromekelleher commented Sep 30, 2024