Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify status of negative integers in spec #12

Open
jeromekelleher opened this issue Feb 20, 2024 · 5 comments
Open

Clarify status of negative integers in spec #12

jeromekelleher opened this issue Feb 20, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@jeromekelleher
Copy link
Contributor

We use -1 to denote missing data and -2 to denote dimension padding for integer fields. This means that we cannot use these as normal values in the VCF.

Clarify what users and implementations are supposed to do in the event negative integer data.

@jeromekelleher jeromekelleher added the bug Something isn't working label Feb 20, 2024
@jeromekelleher
Copy link
Contributor Author

jeromekelleher commented Feb 20, 2024

A reasonable thing to do may be to say "we don't support negative intetgers use a Float instead". That would probably lead to fewer errors, as I think the assumption that missing data is < 0 is baked in pretty well to sgkit now (and it's a useful assumption!)

@tomwhite
Copy link
Collaborator

tomwhite commented Mar 5, 2024

We've used the float values from BCF for missing and fill, so perhaps we should consider what it would take to align with the BCF encoding for missing and fill values for integers?

These are dependent on the width of the integer, so for 8-bit ints missing is 0x80, whereas for 16-bit ints it is 0x8000, but they still satisfy < 0 (since they are signed ints).

@jeromekelleher
Copy link
Contributor Author

I think the approach we have take is is less likely to result in user-error, because distinguishing between missing and pad values requires knowing the type, and to me that would be a pernicious source of bugs in code that assume (e.g.) that genotypes are always 8 bits. Using < 0 as a blanket missingness mask seems wasteful, and would also lead to subtle bugs where missingness and padding are mixed up (e.g., when calculating the denominator in some stats).

That's how I would argue the point for the paper anyway.

@timothymillar
Copy link
Collaborator

I'm a bit torn about this but I think Jerome is correct regarding user error. Maybe we can eventually move to a custom vcfint dtype (see NEP42). That may allow for a user-safe implementation of the BCF sentinel values and a safe migration path. Depending on compatibility with Zarr.

@jeromekelleher
Copy link
Contributor Author

I think we need to think more widely beyond Numpy and python here as well - zarr is cross language, so we want code that can be compiled with these sentinels and be correct. In this context fixed values across integer types is much safer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants