Safer Type System #4069

mbasmanova · 2023-02-17T03:24:53Z

mbasmanova
Feb 17, 2023
Collaborator

In Velox we distinguish between physical and logical data types. Physical types describe how data is stored in memory: 32-bit integer, 64-bit integer, 32-bit floating point, StringView, etc. Logical types describe how to interpret the data: VARCHAR vs. JSON. TypeKind enum identifies the physical type, while instances of Type class identify the logical types. Each logical type maps to a single physical type, but a single physical type may be used by multiple logical types. The set of physical types is fixed and cannot be extended by a Velox application. Logical types are extensible, i.e. an application can introduce custom logical types using one of the predefined physical types.

At the moment, there are multiple places in the code base that assume a 1:1 mapping between physical and logical types. This assumption causes bugs in the presence of custom (application specific) types, e.g. JSON, HYPERLOGLOG, TIMESTAMP WITH TIME ZONE.

The following APIs are affected:

variant::inferType - maps TypeKind (physical type) to an instance of a Type class (logical type)
CppToType template - maps C++ type (physical type) to an instance of a Type class (logical type)
ConstantVector and FlatVector constructors that do not take Type parameter, but rather infer it from the C++ value type using CppToType template.

I suggest to remove these APIs and update the call sites to ensure we do not infer logical types from the physical types.

I would also suggest that we remove some of the TypeKind enum values: VARBINARY, DATE, INTERVAL_DAY_TIME, SHORT_DECIMAL, LONG_DECIMAL. I believe these were added in an attempt to allow for deterministic mapping from physical type to logical type, but that mapping is not possible anyway. I would then suggest to update the corresponding logical types to use the physical types as follows:

VARBINARY -> VARCHAR (StringView)
DATE -> INTEGER
INTERVAL_DAY_TIME -> BIGINT
SHORT_DECIMAL -> BIGINT
LONG_DECIMAL -> HUGEINT (128-bit integer, a new physical type)

CC: @majetideepak @aditi-pandit @karteekmurthys @spershin @kKPulla @pedroerp @bikramSingh91 @laithsakka @kagamiori @xiaoxmeng

pedroerp · 2023-02-17T16:02:56Z

pedroerp
Feb 17, 2023
Collaborator

I vote +1.

I bumped into a few bugs because of this in the past as well. A few comments/questions:

I wonder if we should also rename TypeKind to something like PhysicalTypeKind to make the intention it a bit clearer?
Additionally, since opaque is now a physical type-only should we also consider making OpaqueType abstract (so that it can't be instantiated without being specialized)?
Should we also consider renaming VARCHAR to something like STRING, BUFFER, or BLOB (since the VARCHAR vs. VARBINARY vs. other string/buffer types are purely logical)?
It would also be useful to extend our documentation on Types to add this information, in addition to more details about the new improved UDT support, plus the idea that users shouldn't not think of opaque types as logical types anymore.

Thanks for looking into this!

2 replies

mbasmanova Feb 17, 2023
Collaborator Author

@pedroerp I agree with all the points above. One caveat re: renaming TypeKind is it is so widely used, I do not think we'll have bandwidth to take on this rename.

pedroerp Feb 17, 2023
Collaborator

Maybe we could do it incrementally by creating a type alias and slowly moving the callsites, but agreed that there isn't any urgency on that.

xiaoxmeng · 2023-02-17T18:27:30Z

xiaoxmeng
Feb 17, 2023
Collaborator

+1. Kudu (a realtime analytical storage system) has similar logical and physical type abstractions(code link1, 2). Logical type describes how to interpret user data while physical type determines the data encoding on storage. There could be more than one logical data types mapped to to the same physical type like STRING mapped to BINARY and TIMESTAMP/DECIMAL64 mapped to INT64. For logical type such as DECIMAL64 it has additional column attributes like precision and scale specified in table column schema.

1 reply

bikramSingh91 Feb 17, 2023
Collaborator

@xiaoxmeng Thats a great pointer. I personally had my first exposure to this concept while working on parquet and feel like their spec is a good point of reference https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

aditi-pandit · 2023-02-17T18:35:31Z

aditi-pandit
Feb 17, 2023
Collaborator

Thanks @mbasmanova and @pedroerp. Agree with your points.

Separating the PhysicalType and LogicalType will definitely make the TypeSystem more clearer to understand. The current code uses the TYPEKIND and Type very loosely when they represent different properties, so it does make the code contrived in many places.

Making OPAQUE type abstract also helps understand the intent. We were having some internal discussions about if it could be used as a placeholder type if there is a Presto java type not supported in Velox. But maybe it would be cleaner to have a special type in presto_cpp to handle this without bringing that logic into Velox.

Adding time and time with timezone types is relatively high-priority at Ahana and there are few engineers who are thinking of taking it on. We could plan some of the refactoring together if you'll are open to it.

4 replies

pedroerp Feb 17, 2023
Collaborator

Adding time and time with timezone types is relatively high-priority at Ahana and there are few engineers who are thinking of taking it on. We could plan some of the refactoring together if you'll are open to it.

@mbasmanova and @laithsakka have made tons of progress recently on improving support for UDTs, so it should relatively straightforward to create these as actual extended types.

We were having some internal discussions about if it could be used as a placeholder type if there is a Presto java type not supported in Velox. But maybe it would be cleaner to have a special type in presto_cpp to handle this without bringing that logic into Velox.

I think the idea moving forward is to prevent users from using opaque as logical types, and force them to create actual UDTs. Opaque could still be used by the UDT, but only as a physical type.

mbasmanova Feb 17, 2023
Collaborator Author

Adding time and time with timezone types is relatively high-priority at Ahana

@aditi-pandit These types can use TypeKind::BIGINT as the physical type.

@mbasmanova, @pedroerp
Yes, for Time we to store the value which is ms from midnight of epoch (like DATE which was number of days from epoch). So BIGINT physical type will suffice.

Unlike DATE where we added a new TypeKind (that needed additional integration), we do indeed have more streamlined approaches now. Looking at the new UDT code, I see two patterns :
i) Like JSONType which derives from VarcharType and has a Custom type factory backing for other usage. We could write Time deriving from BigIntType.
ii) Another options is like FancyIntType in https://github.com/facebookincubator/velox/blob/main/velox/expression/tests/CustomTypeTest.cpp making it a full UDT deriving from OpaqueType.

It seems like i) is the style for a basic scalar type. But seems like ii) is more flexible for adding functions. So I am leaning for ii)

What is the recommendation from your side ?

majetideepak Feb 17, 2023
Collaborator

CustomTypes are cumbersome since we need to look up the TypeFactories at all the places where types are checked.
My vote is for ii) as well.

mbasmanova Feb 17, 2023
Collaborator Author

@aditi-pandit I suggest to go with approach (1) and extend the BigintType. OpaqueType should be used in rare cases where you need to store large objects using std::shared_ptr.

majetideepak · 2023-02-17T21:27:29Z

majetideepak
Feb 17, 2023
Collaborator

There are a bunch of places where the physical type is used to infer the type right? How will scalar function registrations look with this change for example?

1 reply

mbasmanova Feb 17, 2023
Collaborator Author

@majetideepak See examples in velox/functions/prestosql/registration/JsonFunctionsRegistration.cpp

karteekmurthys · 2023-02-18T06:27:25Z

karteekmurthys
Feb 18, 2023

SHORT_DECIMAL -> BIGINT
LONG_DECIMAL -> HUGEINT (128-bit integer, a new physical type)

The LogicalType must also take care of defining max and min limits. For instance, the max limits of int128 and LONG_DECIMAL are not the same. This is another reason, we created UnscaledXXX structs to define the max and min limits of (-9999... to +9999..).
Similarly, many semantics related logic like toString, comparisons might have to be supplied by the LogicalType.

1 reply

majetideepak Feb 18, 2023
Collaborator

The DecimalType / UnscaledLong/ShortDecimal will stay. The special decimal TYPE_KIND will be removed as it serves no purpose with this change.

mbasmanova · 2023-04-14T13:10:29Z

mbasmanova
Apr 14, 2023
Collaborator Author

I migrated INTERVAL_DAY_TIME to logical type in #4406. In the process, I discovered an issue with migrating decimal to logical type.

Decimal type is a parametric type with 2 integer parameters: precision and scale.

The physical type used to represent decimal type depends on the value of precision:

BIGINT (64-bit integer) if precision <= 18
HUGEINT (128-bit integer) if precision > 18

ShortDecimalType is an instance of decimal type backed by BIGINT.
LongDecimalType is an instance of decimal type backed by HUGEINT.

Type signature should be the same for all decimal types:

decimal(p, s)

, where p and s are integers.

Currently, there are 3 type signatures:

decimal(p, s)
short_decimal(p, s)
long_decimal(p, s)

The extract short_decimal and long_decimal type signatures are generated by simple function used for comparisons.

A between(a, b) function should have a signature

(decimal(p, s), decimal(p, s)) -> boolean

However, currently, there are 2 signatures:

(short_decimal(p, s), long_decimal(p, s)) -> boolean
(short_decimal(p, s), long_decimal(p, s)) -> boolean

These signatures correspond to 2 separate implementations: one for short decimals and one for long decimals. These signatures are generated by the SimpleFunctionAdapter framework.

template <typename T>
struct BetweenFunction {
  template <typename TInput>
  FOLLY_ALWAYS_INLINE void call(
      bool& result,
      const TInput& value,
      const TInput& low,
      const TInput& high) {
    result = value >= low && value <= high;
  }
};

  registerFunction<
      BetweenFunction,
      bool,
      UnscaledShortDecimal,
      UnscaledShortDecimal,
      UnscaledShortDecimal>({prefix + "between"});

  registerFunction<
      BetweenFunction,
      bool,
      UnscaledLongDecimal,
      UnscaledLongDecimal,
      UnscaledLongDecimal>({prefix + "between"});

I looked into what happens if we make SimpleFunctionAdapter generate “duplicate” signatures:

(decimal(p, s), decimal(p, s)) -> boolean

I found that SimpleFunctionRegistry uses name + signature as a key in a map of implementations. Therefore, it is not possible to store 2 implementations that have the same signature.

  void registerFunctionInternal(
      const std::string& name,
      const std::shared_ptr<const Metadata>& metadata,
      const typename FunctionEntry<Function, Metadata>::FunctionFactory&
          factory) {
    const auto sanitizedName = sanitizeName(name);
    SignatureMap& signatureMap = registeredFunctions_[sanitizedName];
    signatureMap[*metadata->signature()] =
        std::make_unique<const FunctionEntry<Function, Metadata>>(
            metadata, factory);
  }

I discussed with @laithsakka whether it is possible to enhance the registry to allow for storing multiple implementations that have the same logical signature (decimal(p, s)), but different physical signatures (short_decimal vs. long_decimal). We felt that this would be quite difficult and not clear whether this can be supported in a generic way. Therefore, we propose to re-write simple functions that use decimal types using VectorFunction API.

Thoughts?

1 reply

majetideepak Apr 14, 2023
Collaborator

I spent some time trying to improve the simple function registration/binding and agree with the difficulty level.
We could probably re-use the existing Decimal VectorFunction API to implement these simple functions as well.
CC @karteekmurthys

majetideepak · 2023-07-19T12:50:45Z

majetideepak
Jul 19, 2023
Collaborator

@mbasmanova, @aditi-pandit I came across your discussion over removing support for variant::toJson(TypePtr) here #5585 (comment)
I have been working with @Krishna-Prasad-P-V in fixing the issue #5338 here #5406
Variant only stores the physical type value.
But there is a use case where Velox users might want to interpret a variant string by layering a logical type on top.
We can support both toJson(), toJson(TypePtr) APIs and clarify the semantics. I opened a PR here https://github.com/facebookincubator/velox/pull/5720/files
What are your thoughts?

2 replies

mbasmanova Jul 19, 2023
Collaborator Author

@majetideepak Thank you for looking into this. toJson(TypePtr) API sounds useful. I wonder if we could remove toJson() API because I feel it is error prone. Users will not remember how to use it properly.

majetideepak Jul 19, 2023
Collaborator

Makes sense only to keep toJson(TypePtr)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safer Type System #4069

{{title}}

Replies: 7 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Safer Type System #4069

mbasmanova Feb 17, 2023 Collaborator

Replies: 7 comments · 12 replies

pedroerp Feb 17, 2023 Collaborator

mbasmanova Feb 17, 2023 Collaborator Author

pedroerp Feb 17, 2023 Collaborator

xiaoxmeng Feb 17, 2023 Collaborator

bikramSingh91 Feb 17, 2023 Collaborator

aditi-pandit Feb 17, 2023 Collaborator

pedroerp Feb 17, 2023 Collaborator

mbasmanova Feb 17, 2023 Collaborator Author

majetideepak Feb 17, 2023 Collaborator

mbasmanova Feb 17, 2023 Collaborator Author

majetideepak Feb 17, 2023 Collaborator

mbasmanova Feb 17, 2023 Collaborator Author

karteekmurthys Feb 18, 2023

majetideepak Feb 18, 2023 Collaborator

mbasmanova Apr 14, 2023 Collaborator Author

majetideepak Apr 14, 2023 Collaborator

majetideepak Jul 19, 2023 Collaborator

mbasmanova Jul 19, 2023 Collaborator Author

majetideepak Jul 19, 2023 Collaborator

mbasmanova
Feb 17, 2023
Collaborator

Replies: 7 comments 12 replies

pedroerp
Feb 17, 2023
Collaborator

mbasmanova Feb 17, 2023
Collaborator Author

pedroerp Feb 17, 2023
Collaborator

xiaoxmeng
Feb 17, 2023
Collaborator

bikramSingh91 Feb 17, 2023
Collaborator

aditi-pandit
Feb 17, 2023
Collaborator

pedroerp Feb 17, 2023
Collaborator

mbasmanova Feb 17, 2023
Collaborator Author

majetideepak Feb 17, 2023
Collaborator

mbasmanova Feb 17, 2023
Collaborator Author

majetideepak
Feb 17, 2023
Collaborator

mbasmanova Feb 17, 2023
Collaborator Author

karteekmurthys
Feb 18, 2023

majetideepak Feb 18, 2023
Collaborator

mbasmanova
Apr 14, 2023
Collaborator Author

majetideepak Apr 14, 2023
Collaborator

majetideepak
Jul 19, 2023
Collaborator

mbasmanova Jul 19, 2023
Collaborator Author

majetideepak Jul 19, 2023
Collaborator