Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] New kernel to support parsing dates/timestamps string with a timezone parameter. #1655

Closed
3 tasks
res-life opened this issue Dec 15, 2023 · 3 comments
Closed
3 tasks
Assignees

Comments

@res-life
Copy link
Collaborator

res-life commented Dec 15, 2023

Is your feature request related to a problem? Please describe.
ToUnixTimestamp and GetTimestamp or more operators require a format parameter, and they are time zone aware operators. Refer to Spark link

val formatter = formatterOption.getOrElse(getFormatter(fmt.toString))
formatter.parse()

I think our GPU implemetation currently does not support non-utc TZ:

def parseStringAsTimestamp(

When TZ is Asia/Shanghai, to_timestamp("1970-01-01 00:00:00", "yyyy-MM-dd HH:mm:ss") get negative 8 hours instead of zero.

Describe the solution you'd like
Expose the timeparts structure, refer to the following code.
Then rebase local time in a time zone to UTC time. Alfred is woring on this.

https://github.com/rapidsai/cudf/blob/v24.02.00a/cpp/src/strings/convert/convert_datetime.cu#L399-L401

    auto const timeparts = parse_into_parts(d_str);

    return T{T::duration(timestamp_from_parts(timeparts))};

  • We can update cudf code.
  • We can copy cudf code. I prefer this option.
    We have limited supported format
yyyy-MM-dd
yyyy/MM/dd
yyyy-MM
yyyy/MM
dd/MM/yyyy
yyyy-MM-dd HH:mm:ss
MM-dd
MM/dd
dd-MM
dd/MM
MM/yyyy
MM-yyyy
MM/dd/yyyy
MM-dd-yyyy
MMyyyy

We can add more supported format in the future.

@res-life
Copy link
Collaborator Author

As Haoyang said, we can handle this issue together: NVIDIA/spark-rapids#10032

@res-life
Copy link
Collaborator Author

res-life commented Jan 4, 2024

Update:
After sync-up with @NVnavkumar.
Maybe we have a simple solution via GpuTimeZoneDB directly.
We first get the microseconds from 1970-01-01 00:00:00 in UTC which are called instants.
Then use GpuTimeZoneDB to rebase the microseconds according to the timezone.

For example: Parse("1970-01-01 00:00:00", 'yyyy-MM-dd HH:mm:ss') when session time zone is Aisa/Shanghai.

  • First get the microseconds, here it's 0. This means get microseconds from "1970-01-01 00:00:00" in UTC TZ.
  • Then use GpuTimeZoneDB.fromUTC(cv, tz) or GpuTimeZoneDB.toUTC(cv, tz) to rebase the microseconds to what we want.

I think NVIDIA/spark-rapids#10100 will close this, only need to double check.
@thirtiseven If above works, then we only need to handle: NVIDIA/spark-rapids#10032

@thirtiseven
Copy link
Collaborator

The simple solution works, close it and let 10032 track the other parts.

@thirtiseven thirtiseven closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants