-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support fine grained timezone checker instead of type based [databricks] #9719
Conversation
You can use the checker in this PR instead of #9482 |
build |
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala
Outdated
Show resolved
Hide resolved
Tried locally using #9773. Will come up with a fix. =========================== short test summary info ============================ = 184 failed, 8690 passed, 994 skipped, 1520 xfailed, 10569 xpassed, 95 warnings in 14961.58s (4:09:21) =
|
build |
build |
build |
Fix for 341
build |
I fixed those issues locally. Till now, Spark 311, 320, 330, 340, 341. 350 are verified locally and passed. Please help take a further look. |
@pytest.mark.skipif(is_before_spark_341(), reason='`TIMESTAMP_NTZ` is only supported in PySpark 341+') | ||
@pytest.mark.parametrize('date_format', csv_supported_date_formats) | ||
@pytest.mark.parametrize('ts_part', csv_supported_ts_parts) | ||
@pytest.mark.parametrize("timestamp_type", [ | ||
pytest.param('TIMESTAMP_LTZ', marks=pytest.mark.xfail(is_spark_350_or_later(), reason="https://github.com/NVIDIA/spark-rapids/issues/9325")), | ||
"TIMESTAMP_NTZ"]) | ||
@pytest.mark.xfail(is_not_utc(), reason='Timezone is not supported for csv format as https://github.com/NVIDIA/spark-rapids/issues/9653.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this going back to an xfail instead of allow_non_utc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to within csv_infer_schema_timestamp_ntz method, it has a class existence capture assert other than allow_non_utc
case. As we had other test cases cover timezone staff for csv, xfail it instead of making original test case too complex.
@pytest.mark.skipif(is_before_spark_340(), reason='`preferDate` is only supported in Spark 340+') | ||
|
||
@pytest.mark.xfail(is_not_utc(), reason='Timezone is not supported for csv format as https://github.com/NVIDIA/spark-rapids/issues/9653.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Why xfail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as above.
This is due to within csv_infer_schema_timestamp_ntz method, it has a class existence capture assert other than allow_non_utc case. As we had other test cases cover timezone staff for csv, xfail it instead of making original test case too complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then how does #9653 cover that? I don't see CSV or JSON mentioned in there at all. I don't wee any mention of timestamp_ntz being mentioned. I just want to be 100% sure that we don't end up dropping something and getting data corruption that we missed because of an xfail. I would rather see a complex test, or at a minimum an issue filed to do the test correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated by refining the test cases.
@@ -204,20 +207,20 @@ def test_json_ts_formats_round_trip(spark_tmp_path, date_format, ts_part, v1_ena | |||
conf=updated_conf) | |||
|
|||
@allow_non_gpu('FileSourceScanExec', 'ProjectExec') | |||
@pytest.mark.skipif(is_before_spark_341(), reason='`TIMESTAMP_NTZ` is only supported in PySpark 341+') | |||
@pytest.mark.skipif(is_before_spark_341(), reason='`TIMESTAMP_NTZ` is only supported in PySpark 341+.') | |||
@pytest.mark.xfail(is_not_utc(), reason='Timezone is not supported for json format as https://github.com/NVIDIA/spark-rapids/issues/9653.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too why is there still an xfail here? and below for other JSON tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar reason as above. Within json_ts_formats_round_trip_ntz method, there exists assert_cpu_and_gpu_are_equal_collect_with_capture
.
build |
build |
exist_classes = 'Gpu' + cpu_scan_class, | ||
non_exist_classes = cpu_scan_class, | ||
conf = conf) | ||
if is_not_utc(): # non UTC is not support for csv, skip capture check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please file a follow on issue to comeback an look at this and the JSON case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build |
Pre-merge was failed due to #9916 Try to rerun it. |
build |
2 similar comments
build |
build |
Failed case is irrelevant. Filed another ticket about #9923
|
Changes include:
(1) Provide a fine grained checker instead of type based. It closes #6832
(2) Verify new check using current_date function. It closes #9805, which contributes to #6841
For RapidsMeta, there're 6 types:
spark.sql.parquet.datetimeRebaseModeInRead
and we will throw exception [FEA] Take into accountorg.apache.spark.timeZone
in Parquet/Avro from Spark 3.2 #9632 in Non-UTC case.For ExprMeta timezone check, it follows the workflow below:
needTimezoneCheck
.spark.rapids.sql.nonUtc.enabled
. Ifyes, we pass to next level timezone check. If not, we only pass UTC case as before.
toggle flag
isTimezoneSupported
for this. If false, fallback to UTC-only check asbefore. If yes, move to next level check. When we add timezone support for a related
function.
isTimezoneSupported
should be override as true. This check happens withinTODOs: