Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF-Zarr spec does not support partial phasing following the VCF 4.4 spec #24

Open
timothymillar opened this issue Jul 1, 2024 · 3 comments
Labels

Comments

@timothymillar
Copy link
Collaborator

The VCF 4.4 spec allows for an initial symbol indicating the phasing of the first allele. For example, /0/1 is a valid genotype. This allows for partially phased diploid genotypes such as |0/1. The current VCF-Zarr spec encodes phasing using a single bool which implicitly assumes either no phasing or complete phasing.

This may have also be an issue in earlier versions of the VCF spec where a partially phased polyploid could have been encoded (e.g., 0/1|1/2). However, this isn't explicitly allowed in the 4.3 spec AFAICT.

@timothymillar timothymillar changed the title VCF-Zarr spec does not support partial phasing VCF-Zarr spec does not support partial phasing following the VCF 4.4 spec Jul 1, 2024
@tomwhite
Copy link
Collaborator

tomwhite commented Jul 1, 2024

We could change call_genotype_phased to have shape (variants, samples, ploidy) to support partial phasing. We could also support a shape of (variants, samples) for backwards compatibility.

@tomwhite
Copy link
Collaborator

tomwhite commented Jul 1, 2024

BTW I just created a vcf-4.4 label for this - there may be other changes we should track.

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Jul 1, 2024

I think we should consider adding a call_genotype_phase field of type integer which explicitly assigns a phase (0, ..., ploidy - 1) to each call. This would allow us to add estimated phase to datasets after the fact, rather than requiring a whole new dataset to be created when we run phasing algorithms. Ultimately this is where we want to get to with large biobanks (you could imagine having both call_genotype_phase_beagle and call_genotype_phase_shapeit stored).

There's some complexity here with how to interact with the PS and PSL fields I haven't got my head around, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants