Clarify status of negative integers in spec #12

jeromekelleher · 2024-02-20T12:42:01Z

We use -1 to denote missing data and -2 to denote dimension padding for integer fields. This means that we cannot use these as normal values in the VCF.

Clarify what users and implementations are supposed to do in the event negative integer data.

jeromekelleher · 2024-02-20T12:48:20Z

A reasonable thing to do may be to say "we don't support negative intetgers use a Float instead". That would probably lead to fewer errors, as I think the assumption that missing data is < 0 is baked in pretty well to sgkit now (and it's a useful assumption!)

tomwhite · 2024-03-05T10:00:13Z

We've used the float values from BCF for missing and fill, so perhaps we should consider what it would take to align with the BCF encoding for missing and fill values for integers?

These are dependent on the width of the integer, so for 8-bit ints missing is 0x80, whereas for 16-bit ints it is 0x8000, but they still satisfy < 0 (since they are signed ints).

jeromekelleher · 2024-03-05T10:58:20Z

I think the approach we have take is is less likely to result in user-error, because distinguishing between missing and pad values requires knowing the type, and to me that would be a pernicious source of bugs in code that assume (e.g.) that genotypes are always 8 bits. Using < 0 as a blanket missingness mask seems wasteful, and would also lead to subtle bugs where missingness and padding are mixed up (e.g., when calculating the denominator in some stats).

That's how I would argue the point for the paper anyway.

timothymillar · 2024-03-05T20:05:30Z

I'm a bit torn about this but I think Jerome is correct regarding user error. Maybe we can eventually move to a custom vcfint dtype (see NEP42). That may allow for a user-safe implementation of the BCF sentinel values and a safe migration path. Depending on compatibility with Zarr.

jeromekelleher · 2024-03-06T08:49:30Z

I think we need to think more widely beyond Numpy and python here as well - zarr is cross language, so we want code that can be compiled with these sentinels and be correct. In this context fixed values across integer types is much safer.

jeromekelleher added the bug Something isn't working label Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify status of negative integers in spec #12

Clarify status of negative integers in spec #12

jeromekelleher commented Feb 20, 2024

jeromekelleher commented Feb 20, 2024 •

edited

Loading

tomwhite commented Mar 5, 2024

jeromekelleher commented Mar 5, 2024

timothymillar commented Mar 5, 2024

jeromekelleher commented Mar 6, 2024

Clarify status of negative integers in spec #12

Clarify status of negative integers in spec #12

Comments

jeromekelleher commented Feb 20, 2024

jeromekelleher commented Feb 20, 2024 • edited Loading

tomwhite commented Mar 5, 2024

jeromekelleher commented Mar 5, 2024

timothymillar commented Mar 5, 2024

jeromekelleher commented Mar 6, 2024

jeromekelleher commented Feb 20, 2024 •

edited

Loading