-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify status of negative integers in spec #12
Comments
A reasonable thing to do may be to say "we don't support negative intetgers use a Float instead". That would probably lead to fewer errors, as I think the assumption that missing data is < 0 is baked in pretty well to sgkit now (and it's a useful assumption!) |
We've used the float values from BCF for missing and fill, so perhaps we should consider what it would take to align with the BCF encoding for missing and fill values for integers? These are dependent on the width of the integer, so for 8-bit ints missing is 0x80, whereas for 16-bit ints it is 0x8000, but they still satisfy < 0 (since they are signed ints). |
I think the approach we have take is is less likely to result in user-error, because distinguishing between missing and pad values requires knowing the type, and to me that would be a pernicious source of bugs in code that assume (e.g.) that genotypes are always 8 bits. Using < 0 as a blanket missingness mask seems wasteful, and would also lead to subtle bugs where missingness and padding are mixed up (e.g., when calculating the denominator in some stats). That's how I would argue the point for the paper anyway. |
I'm a bit torn about this but I think Jerome is correct regarding user error. Maybe we can eventually move to a custom |
I think we need to think more widely beyond Numpy and python here as well - zarr is cross language, so we want code that can be compiled with these sentinels and be correct. In this context fixed values across integer types is much safer. |
We use -1 to denote missing data and -2 to denote dimension padding for integer fields. This means that we cannot use these as normal values in the VCF.
Clarify what users and implementations are supposed to do in the event negative integer data.
The text was updated successfully, but these errors were encountered: