Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#61 Add infinity support to Dates and Numerics #63

Merged
merged 5 commits into from
Oct 31, 2024

Conversation

Zejnilovic
Copy link
Collaborator

Closes #61

Adds support or infinity replacements on Dates, Timestamps and Numeric fields.

It works before the "spark built-in" infinity replacements that use and has no dependency on allowInfinity metadata field.

New metadata fields are now supported are:

  • minus_infinity_symbol
  • minus_infinity_value
  • plus_infinity_symbol
  • plus_infinity_value

Copy link

github-actions bot commented Oct 9, 2024

JaCoCo code coverage report - scala 2.12.12

File Coverage [94.51%] 🍏
InfinitySupport.scala 100% 🍏
MetadataKeys.scala 95.73% 🍏
TypeParser.scala 94.42% 🍏
Total Project Coverage 87.65% 🍏

@Zejnilovic Zejnilovic self-assigned this Oct 9, 2024
@@ -681,4 +695,27 @@ object TypeParser {
}
}

sealed trait InfinitySupport {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sealed trait?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could and perhaps should be in its own file even.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason is that this is a library and there are not supposed to be used outside. It is supposed to be only used with our defined parsers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code of this trait is written in such a way, that outside usage doesn't really causes issue, if somebody finds it useful - unlikely 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the "hiding" is important, when the class/trait have special handling needs or has access to more sensitive parts of the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved out

@@ -1 +1 @@
sbt.version=1.6.2
sbt.version=1.10.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Collaborator Author

@Zejnilovic Zejnilovic Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not. Is it an issue? There was a warning when debuging JaCoCo

@@ -681,4 +695,27 @@ object TypeParser {
}
}

sealed trait InfinitySupport {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could and perhaps should be in its own file even.

@yruslan
Copy link
Contributor

yruslan commented Oct 9, 2024

Here is a problematic use case.

Config:

{
 "pattern": "yyyyMMdd", 
 "minus_infinity_symbol": "0", 
 "minus_infinity_value": "00010101", 
 "plus_infinity_symbol": "99999999", 
 "plus_infinity_value": "99991231"
}

When "minus_infinity_value": "10000101" this works fine. But when "minus_infinity_value": "00010101" it fails to properly convery minus infinity.

I believe this is because this implementation tries to replace all '0' with value 00010101 using the original type, which converts '0' to 10101 since the original type is integer.

In order to have pure solution that covers this and similar use cases maybe we should do it like this (pseudocode):

IF col == plus_inf_symbol THEN plus_inf_value
ELSE IF  col ==  minus_inf_symbol THEN minus_inf_value
ELSE convert(col, pattern)

where plus_inf_value and minus_inf_value are precomputed dates.

It is more complicated, but covers edge cases like this.

Alternatively

You can convert the input column to a string first, then do the replacement, and then do a conversion

In addition (optional, just thinking)

Maybe we can consider specifying minus_infinity_value and plus_infinity_value always in some standard pattern, so that it can represent values that are outside of parsing pattern.

Say, this can solve weird cases like this:

{
 "pattern": "yyyy-MM", 
 "minus_infinity_symbol": "0", 
 "minus_infinity_value": "0001-01-15", 
 "plus_infinity_symbol": "999999", 
 "plus_infinity_value": "9999-12-31"
}

@Zejnilovic
Copy link
Collaborator Author

Here is a problematic use case.

Config:

{
 "pattern": "yyyyMMdd", 
 "minus_infinity_symbol": "0", 
 "minus_infinity_value": "00010101", 
 "plus_infinity_symbol": "99999999", 
 "plus_infinity_value": "99991231"
}

When "minus_infinity_value": "10000101" this works fine. But when "minus_infinity_value": "00010101" it fails to properly convery minus infinity.

I believe this is because this implementation tries to replace all '0' with value 00010101 using the original type, which converts '0' to 10101 since the original type is integer.

In order to have pure solution that covers this and similar use cases maybe we should do it like this (pseudocode):

IF col == plus_inf_symbol THEN plus_inf_value
ELSE IF  col ==  minus_inf_symbol THEN minus_inf_value
ELSE convert(col, pattern)

where plus_inf_value and minus_inf_value are precomputed dates.

It is more complicated, but covers edge cases like this.

Alternatively

You can convert the input column to a string first, then do the replacement, and then do a conversion

In addition (optional, just thinking)

Maybe we can consider specifying minus_infinity_value and plus_infinity_value always in some standard pattern, so that it can represent values that are outside of parsing pattern.

Say, this can solve weird cases like this:

{
 "pattern": "yyyy-MM", 
 "minus_infinity_symbol": "0", 
 "minus_infinity_value": "0001-01-15", 
 "plus_infinity_symbol": "999999", 
 "plus_infinity_value": "9999-12-31"
}

My thoughts were that an integer or numeric columns in general would never come with zeros in the front (i.e. 01010001 or 00010101) irrespective of the infinity support. Infinity support with a solution like that would solve for something that could never happen on simple input data and create an edge case. Thus did not want to include it.

As per @yruslan we will quickly confirm if this is need feature and think of a solution or if this is good enough.

@yruslan
Copy link
Contributor

yruslan commented Oct 21, 2024

I'm okay with the current implementation.

@kevinwallimann
Copy link

Here is a problematic use case.
Config:

{
 "pattern": "yyyyMMdd", 
 "minus_infinity_symbol": "0", 
 "minus_infinity_value": "00010101", 
 "plus_infinity_symbol": "99999999", 
 "plus_infinity_value": "99991231"
}

When "minus_infinity_value": "10000101" this works fine. But when "minus_infinity_value": "00010101" it fails to properly convery minus infinity.
I believe this is because this implementation tries to replace all '0' with value 00010101 using the original type, which converts '0' to 10101 since the original type is integer.
In order to have pure solution that covers this and similar use cases maybe we should do it like this (pseudocode):

IF col == plus_inf_symbol THEN plus_inf_value
ELSE IF  col ==  minus_inf_symbol THEN minus_inf_value
ELSE convert(col, pattern)

where plus_inf_value and minus_inf_value are precomputed dates.
It is more complicated, but covers edge cases like this.

Alternatively

You can convert the input column to a string first, then do the replacement, and then do a conversion

In addition (optional, just thinking)

Maybe we can consider specifying minus_infinity_value and plus_infinity_value always in some standard pattern, so that it can represent values that are outside of parsing pattern.
Say, this can solve weird cases like this:

{
 "pattern": "yyyy-MM", 
 "minus_infinity_symbol": "0", 
 "minus_infinity_value": "0001-01-15", 
 "plus_infinity_symbol": "999999", 
 "plus_infinity_value": "9999-12-31"
}

My thoughts were that an integer or numeric columns in general would never come with zeros in the front (i.e. 01010001 or 00010101) irrespective of the infinity support. Infinity support with a solution like that would solve for something that could never happen on simple input data and create an edge case. Thus did not want to include it.

As per @yruslan we will quickly confirm if this is need feature and think of a solution or if this is good enough.

Tbh, I think the argument doesn't make sense from a user's perspective. Why should a minus infinity symbol of 0 be standardized to a date 1000-01-01 and not 0001-01-01, or even 0000-01-01 for that matter. But we can tell users that the minimal supported date is 1000-01-01, so I'm also ok with the functionality.

Copy link
Contributor

@benedeki benedeki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Zejnilovic Zejnilovic merged commit 8e36e5f into master Oct 31, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to set values for infinite numbers for dates and timestamps
5 participants