Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_date_format_for_time and test_date_format_maybe_incompat failed in non-utc job #10083

Open
2 tasks
abellina opened this issue Dec 20, 2023 · 10 comments
Open
2 tasks
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@abellina
Copy link
Collaborator

abellina commented Dec 20, 2023

11:47:30  �[36m�[1m=========================== short test summary info ============================�[0m
11:47:30  �[31mFAILED�[0m ../../src/main/python/date_time_test.py::�[1mtest_date_format_for_time[Timestamp-yyyy-MM-dd][DATAGEN_SEED=1703089300, INJECT_OOM]�[0m - AssertionError: GPU and CPU string values are different at [6, 'date_format(a, yyyy-MM-dd)']
11:47:30  �[31mFAILED�[0m ../../src/main/python/date_time_test.py::�[1mtest_date_format_for_time[Timestamp-yyyy-MM][DATAGEN_SEED=1703089300]�[0m - AssertionError: GPU and CPU string values are different at [6, 'date_format(a, yyyy-MM)']
11:47:30  �[31mFAILED�[0m ../../src/main/python/date_time_test.py::�[1mtest_date_format_for_time[Timestamp-yyyy/MM/dd][DATAGEN_SEED=1703089300, INJECT_OOM]�[0m - AssertionError: GPU and CPU string values are different at [6, 'date_format(a, yyyy/MM/dd)']
11:47:30  �[31mFAILED�[0m ../../src/main/python/date_time_test.py::�[1mtest_date_format_for_time[Timestamp-yyyy/MM][DATAGEN_SEED=1703089300]�[0m - AssertionError: GPU and CPU string values are different at [6, 'date_format(a, yyyy/MM)']
11:47:30  �[31mFAILED�[0m ../../src/main/python/date_time_test.py::�[1mtest_date_format_for_time[Timestamp-dd/MM/yyyy][DATAGEN_SEED=1703089300]�[0m - AssertionError: GPU and CPU string values are different at [6, 'date_format(a, dd/MM/yyyy)']
11:47:30  �[31mFAILED�[0m ../../src/main/python/date_time_test.py::�[1mtest_date_format_maybe_incompat[Timestamp-dd-MM-yyyy][DATAGEN_SEED=1703089300, INJECT_OOM, ALLOW_NON_GPU(ProjectExec,FilterExec,FileSourceScanExec,BatchScanExec,CollectLimitExec,DeserializeToObjectExec,DataWritingCommandExec,WriteFilesExec,ShuffleExchangeExec,ExecutedCommandExec)]�[0m - AssertionError: GPU and CPU string values are different at [6, 'date_format(a, dd-MM-yyyy)']
11:47:30  �[31mFAILED�[0m ../../src/main/python/date_time_test.py::�[1mtest_date_format_maybe_incompat[Timestamp-yyyy-MM-dd HH:mm:ss.SSS][DATAGEN_SEED=1703089300, ALLOW_NON_GPU(ProjectExec,FilterExec,FileSourceScanExec,BatchScanExec,CollectLimitExec,DeserializeToObjectExec,DataWritingCommandExec,WriteFilesExec,ShuffleExchangeExec,ExecutedCommandExec)]�[0m - AssertionError: GPU and CPU string values are different at [6, 'date_format(a, yyyy-MM-dd HH:mm:ss.SSS)']
11:47:30  �[31mFAILED�[0m ../../src/main/python/date_time_test.py::�[1mtest_date_format_maybe_incompat[Timestamp-yyyy-MM-dd HH:mm:ss.SSSSSS][DATAGEN_SEED=1703089300, ALLOW_NON_GPU(ProjectExec,FilterExec,FileSourceScanExec,BatchScanExec,CollectLimitExec,DeserializeToObjectExec,DataWritingCommandExec,WriteFilesExec,ShuffleExchangeExec,ExecutedCommandExec)]�[0m - AssertionError: GPU and CPU string values are different at [6, 'date_format(a, yyyy-MM-dd HH:mm:ss.SSSSSS)']
11:47:30  �[31m= �[31m�[1m8 failed�[0m, �[32m19084 passed�[0m, �[33m2566 skipped�[0m, �[33m380 xfailed�[0m, �[33m412 xpassed�[0m, �[33m883 warnings�[0m�[31m in 5149.70s 

It looks like the formatted row is getting an extra + sign:

11:47:30  -Row(date_format(a, yyyy-MM-dd HH:mm:ss.SSSSSS)='+10000-01-01 03:29:59.999999')
11:47:30  +Row(date_format(a, yyyy-MM-dd HH:mm:ss.SSSSSS)='0000-01-01 03:29:59.999999')

After fix #10032, please update the following:
Change time range to the full range. Now max time is 9999-12-30, not 9999-12-31

  • test_date_format
    Change time range to the full range. Now max time is 9999-12-30, not 9999-12-31

  • update test_from_unixtime to use 9999-12-31, Change time range to the full range

@pytest.mark.parametrize('data_gen', [LongGen(min_val=int(datetime(1, 2, 1).timestamp()), max_val=int(datetime(9999, 12, 30).timestamp()))], ids=idfn)
test_from_unixtime
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 20, 2023
@NVnavkumar NVnavkumar self-assigned this Dec 20, 2023
@NVnavkumar
Copy link
Collaborator

Confirmed reproduce via:

TZ=Iran ./run_pyspark_from_build.sh -k 'test_date_format_for_time or test_date_format_maybe_incompat'

Seed value doesn't seem to matter

@NVnavkumar NVnavkumar assigned NVnavkumar and res-life and unassigned NVnavkumar Dec 20, 2023
@ttnghia
Copy link
Collaborator

ttnghia commented Dec 25, 2023

It seems like an overflow for the values. The year values should be up to 9999.

11:47:30  -Row(date_format(a, yyyy-MM-dd HH:mm:ss.SSSSSS)='+10000-01-01 03:29:59.999999')
11:47:30  +Row(date_format(a, yyyy-MM-dd HH:mm:ss.SSSSSS)='0000-01-01 03:29:59.999999')

@res-life
Copy link
Collaborator

res-life commented Dec 26, 2023

cuDF can not handle years that are bigger than 9999.

  test("test year 9999") {
    println("my debug: begin: ")
    val maxSecond = Instant.parse("9999-12-31T23:59:59z").getEpochSecond
    // plus 8 hours
    val cv = ColumnVector.timestampMicroSecondsFromBoxedLongs(
      maxSecond * TimeUnit.SECONDS.toMicros(1) + TimeUnit.HOURS.toMicros(8))
    val ret = cv.asStrings("%Y-%m-%d")
    val host = ret.copyToHost()
    println(host.getJavaString(0))
    println("my debug: end")
  }

Output:

my debug: begin: 
0000-01-01
my debug: end

I guess cuDF expect %Y prints 4 digits, so it truncates 10000 to 0000.

Spark output is:

+10000-01-01

@res-life res-life added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Dec 26, 2023
@res-life
Copy link
Collaborator

res-life commented Dec 26, 2023

Depending on cuDF to handle years bigger than 9999:

cv.asStrings("%Y-%m-%d")

The long value in cv is: 253402329599000000L
253402329599000000L = micros of 9999-12-31T23:59:59 + micors of 8 hours.

@ttnghia
Copy link
Collaborator

ttnghia commented Dec 26, 2023

This is not trivially supported in cudf since it requires fixed width input strings for each field. For example, %Y requires 4 numbers like 0001.

Ref: https://github.com/rapidsai/cudf/blob/branch-24.02/cpp/src/strings/convert/convert_datetime.cu#L114

@res-life
Copy link
Collaborator

Thanks @ttnghia

Related to the following issues:
NVIDIA/spark-rapids-jni#1655
#10032

@res-life
Copy link
Collaborator

Java API returns +10000-01-01

    val p = DateTimeFormatter.ofPattern("yyyy-MM-dd")
    val s = p.format(Instant.ofEpochSecond(253402329599L).atZone(ZoneId.of("Asia/Shanghai")).toLocalDate)
    println(s)

@res-life
Copy link
Collaborator

res-life commented Dec 27, 2023

Temporarily fix: #10095
Final fix depends on: #10032

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Dec 27, 2023
@NVnavkumar
Copy link
Collaborator

@res-life I think from_unixtime (GpuFromUnixTime) will also have the same overflow issue for non-UTC timezones, since it also relies on asStrings(strfFormat) like GpuDateFormatClass

@res-life
Copy link
Collaborator

Thanks @NVnavkumar

@pytest.mark.parametrize('data_gen', [LongGen(min_val=int(datetime(1, 2, 1).timestamp()), max_val=int(datetime(9999, 12, 30).timestamp()))], ids=idfn)
test_from_unixtime

Currently from_unixtime does not use full ragne, it will not fail.

I added a sub-task in this issue to track this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

5 participants