Safer Type System #4069
Replies: 7 comments 12 replies
-
I vote +1. I bumped into a few bugs because of this in the past as well. A few comments/questions:
Thanks for looking into this! |
Beta Was this translation helpful? Give feedback.
-
+1. Kudu (a realtime analytical storage system) has similar logical and physical type abstractions(code link1, 2). Logical type describes how to interpret user data while physical type determines the data encoding on storage. There could be more than one logical data types mapped to to the same physical type like STRING mapped to BINARY and TIMESTAMP/DECIMAL64 mapped to INT64. For logical type such as DECIMAL64 it has additional column attributes like precision and scale specified in table column schema. |
Beta Was this translation helpful? Give feedback.
-
Thanks @mbasmanova and @pedroerp. Agree with your points. Separating the PhysicalType and LogicalType will definitely make the TypeSystem more clearer to understand. The current code uses the TYPEKIND and Type very loosely when they represent different properties, so it does make the code contrived in many places. Making OPAQUE type abstract also helps understand the intent. We were having some internal discussions about if it could be used as a placeholder type if there is a Presto java type not supported in Velox. But maybe it would be cleaner to have a special type in presto_cpp to handle this without bringing that logic into Velox. Adding time and time with timezone types is relatively high-priority at Ahana and there are few engineers who are thinking of taking it on. We could plan some of the refactoring together if you'll are open to it. |
Beta Was this translation helpful? Give feedback.
-
There are a bunch of places where the physical type is used to infer the type right? How will scalar function registrations look with this change for example? |
Beta Was this translation helpful? Give feedback.
-
The LogicalType must also take care of defining max and min limits. For instance, the max limits of int128 and LONG_DECIMAL are not the same. This is another reason, we created UnscaledXXX structs to define the max and min limits of (-9999... to +9999..). |
Beta Was this translation helpful? Give feedback.
-
I migrated INTERVAL_DAY_TIME to logical type in #4406. In the process, I discovered an issue with migrating decimal to logical type. Decimal type is a parametric type with 2 integer parameters: precision and scale. The physical type used to represent decimal type depends on the value of precision: BIGINT (64-bit integer) if precision <= 18 ShortDecimalType is an instance of decimal type backed by BIGINT. Type signature should be the same for all decimal types:
, where p and s are integers. Currently, there are 3 type signatures:
The extract short_decimal and long_decimal type signatures are generated by simple function used for comparisons. A between(a, b) function should have a signature
However, currently, there are 2 signatures:
These signatures correspond to 2 separate implementations: one for short decimals and one for long decimals. These signatures are generated by the SimpleFunctionAdapter framework.
I looked into what happens if we make SimpleFunctionAdapter generate “duplicate” signatures:
I found that SimpleFunctionRegistry uses name + signature as a key in a map of implementations. Therefore, it is not possible to store 2 implementations that have the same signature.
I discussed with @laithsakka whether it is possible to enhance the registry to allow for storing multiple implementations that have the same logical signature (decimal(p, s)), but different physical signatures (short_decimal vs. long_decimal). We felt that this would be quite difficult and not clear whether this can be supported in a generic way. Therefore, we propose to re-write simple functions that use decimal types using VectorFunction API. Thoughts? |
Beta Was this translation helpful? Give feedback.
-
@mbasmanova, @aditi-pandit I came across your discussion over removing support for |
Beta Was this translation helpful? Give feedback.
-
In Velox we distinguish between physical and logical data types. Physical types describe how data is stored in memory: 32-bit integer, 64-bit integer, 32-bit floating point, StringView, etc. Logical types describe how to interpret the data: VARCHAR vs. JSON. TypeKind enum identifies the physical type, while instances of Type class identify the logical types. Each logical type maps to a single physical type, but a single physical type may be used by multiple logical types. The set of physical types is fixed and cannot be extended by a Velox application. Logical types are extensible, i.e. an application can introduce custom logical types using one of the predefined physical types.
At the moment, there are multiple places in the code base that assume a 1:1 mapping between physical and logical types. This assumption causes bugs in the presence of custom (application specific) types, e.g. JSON, HYPERLOGLOG, TIMESTAMP WITH TIME ZONE.
The following APIs are affected:
I suggest to remove these APIs and update the call sites to ensure we do not infer logical types from the physical types.
I would also suggest that we remove some of the TypeKind enum values: VARBINARY, DATE, INTERVAL_DAY_TIME, SHORT_DECIMAL, LONG_DECIMAL. I believe these were added in an attempt to allow for deterministic mapping from physical type to logical type, but that mapping is not possible anyway. I would then suggest to update the corresponding logical types to use the physical types as follows:
CC: @majetideepak @aditi-pandit @karteekmurthys @spershin @kKPulla @pedroerp @bikramSingh91 @laithsakka @kagamiori @xiaoxmeng
Beta Was this translation helpful? Give feedback.
All reactions