-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [DB 14.3] tightBounds
stat in Delta Lake tables is set incorrectly
#12027
Comments
DB 14.3 might be the first time we've run on a platform that supports deletion vectors. From tracing, it appears that this doesn't have to do with the conf value set for It appears that the failure is the result of the shim not being detected as capable of supporting deletion vectors. I'm still investigating. |
I'm seeing some truly baffling behaviour, and discrepancies between successive runs of the following test code: spark.range(1, 19).toDF("id").write.mode("overwrite").format("delta").save("/tmp/gpu_delta_out")
I'll hit this afresh tomorrow morning. |
If |
The documentation in the code wasn't the easiest to follow. This looks like an internal configuration for The non-deterministic behaviour I was referring to: On some GPU runs, I found that For completeness, here is the definition of |
A little bit of progress. The problem is seen when a delta table file is written as follows: spark.range(0, 10).toDF("id").coalesce(1).write.format("delta").save("/tmp/delta_test") The The reason for this is what happens in // On file initialization/stat recomputation TIGHT_BOUNDS is always set to true
val tightBoundsColOpt = if (deletionVectorsSupported &&
!RapidsDeltaUtils.getTightBoundColumnOnFileInitDisabled(spark)) {
Some(lit(true).as("tightBounds"))
} else {
None
}
The override val deletionVectorsSupported =
protocol.isFeatureSupported(DeletionVectorsTableFeature) At some point, we find that the Delta Log protocol lands up being switched from It turns out that this happens seemingly as a side-effect of calling def this(deltaLog: DeltaLog, rapidsConf: RapidsConf)(implicit clock: Clock) = {
this(deltaLog, deltaLog.update(), rapidsConf)
} So shouldn't this failure also happen when running this test on Delta IO (2.4) and vanilla Apache Spark (say 3.4.3)? No, because that platform does not support deletion vectors by default. I have verified that when deletion vectors are enabled, then the same problem occurs on Spark 3.4.3. i.e. The stat goes missing when we write with spark.conf.set("spark.databricks.delta.properties.defaults.enableDeletionVectors", true) The conclusion is that a fix here for Databricks 14.3 should apply uniformly to Spark 3.4.3 as well. We'd have to detect that deletion vectors are supported/enabled on the platform, and write the |
Another bit of information: When the same test is run on @razajafri, I think we might have come to an incorrect conclusion about the behaviour: We thought that the |
This has to do with the following failing tests, described in #11541:
In all these tests, one sees that a delta-lake table written with
spark-rapids
does not correctly set thetightBounds
stat correctly in the delta-lake meta files.Repro
Write a simple delta table out on Databricks 14.3:
In the table's
_delta_log/00000000000000000000.json
, one sees:The same stats, for a table written from
spark-rapids
, looks like:Note that the
tightBounds
stat goes missing.Without this stat, the tables can't be deemed equivalent, and the tests fail.
The text was updated successfully, but these errors were encountered: