Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [DB 14.3] tightBounds stat in Delta Lake tables is set incorrectly #12027

Open
mythrocks opened this issue Jan 25, 2025 · 6 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@mythrocks
Copy link
Collaborator

This has to do with the following failing tests, described in #11541:

1. delta_lake_write_test.py::test_delta_write_round_trip_unmanaged
2. delta_lake_write_test.py::test_delta_overwrite_round_trip_unmanaged
3. delta_lake_write_test.py::test_delta_append_round_trip_unmanaged
4. delta_lake_write_test.py::test_delta_write_round_trip_cdf_write_opt
5. delta_lake_write_test.py::test_delta_write_aqe_join

In all these tests, one sees that a delta-lake table written with spark-rapids does not correctly set the tightBounds stat correctly in the delta-lake meta files.

Repro

Write a simple delta table out on Databricks 14.3:

spark.range(0, 10).toDF("id").coalesce(1).write.format("delta").save("/tmp/delta_test")

In the table's _delta_log/00000000000000000000.json, one sees:

"stats": "{\"numRecords\":10,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":9},\"nullCount\":{\"id\":0},\"tightBounds\":true}"

The same stats, for a table written from spark-rapids, looks like:

"stats": "{\"numRecords\":10,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":9},\"nullCount\":{\"id\":0}}"

Note that the tightBounds stat goes missing.

Without this stat, the tables can't be deemed equivalent, and the tests fail.

@mythrocks mythrocks added ? - Needs Triage Need team to review and classify bug Something isn't working and removed ? - Needs Triage Need team to review and classify labels Jan 25, 2025
@mythrocks mythrocks self-assigned this Jan 25, 2025
@mythrocks
Copy link
Collaborator Author

mythrocks commented Jan 25, 2025

DB 14.3 might be the first time we've run on a platform that supports deletion vectors. From tracing, it appears that this doesn't have to do with the conf value set for deletionVectors.disableTightBoundOnFileCreationForDevOnly. (Aside: This doesn't seem an exemplary config name; it returns "true" if the feature is disabled. )

It appears that the failure is the result of the shim not being detected as capable of supporting deletion vectors. I'm still investigating.

@mythrocks
Copy link
Collaborator Author

I'm seeing some truly baffling behaviour, and discrepancies between successive runs of the following test code:

spark.range(1, 19).toDF("id").write.mode("overwrite").format("delta").save("/tmp/gpu_delta_out")
  1. On the first run (when the output directory does not exist), deletionVectorSupported == false.
  2. On the very next run, deletionVectorSupported == true. Thereafter, tightBounds is either set or unset, depending on the value of deletionVectors.disableTightBoundOnFileCreationForDevOnly. I've seen both, and I'm not sure where the setting comes from.

I'll hit this afresh tomorrow morning.

@revans2
Copy link
Collaborator

revans2 commented Jan 28, 2025

If tightBounds is set it indicates that the min/max values stored in the parquet file and/or the metadata are exact. Meaning the max value that would be read is truly the maximum value in the file/row group. If a row is deleted using deletion vectors then technically the min/max bounds are no longer tight. They might be wider than what would happen if the file were read. tightBounds only makes since in the context where something can be added or removed without actually modifying the file. Not sure why it might be different for different runs. It should always be set if the table is being created, but I don't understand the disable...ForDevOnly command or how we would be using it in the tests.

@mythrocks
Copy link
Collaborator Author

mythrocks commented Jan 28, 2025

It should always be set if the table is being created, but I don't understand the disable...ForDevOnly command or how we would be using it in the tests.

The documentation in the code wasn't the easiest to follow. This looks like an internal configuration for spark-rapids testing.

The non-deterministic behaviour I was referring to: On some GPU runs, I found that deletionVectors.disableTightBoundOnFileCreationForDevOnly was being set implicitly. I couldn't determine where, last night. I'm debugging now.

For completeness, here is the definition of deletionVectors.disableTightBoundOnFileCreationForDevOnly:
https://github.com/delta-io/delta/blob/a6db4e05d943fec019522c9c08ff0b64125cbc5d/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala#L2118

@mythrocks
Copy link
Collaborator Author

mythrocks commented Jan 30, 2025

A little bit of progress.

The problem is seen when a delta table file is written as follows:

spark.range(0, 10).toDF("id").coalesce(1).write.format("delta").save("/tmp/delta_test")

The _delta_log/*json metadata does not contain the tightBounds stat when written by spark-rapids, while Databricks does write this stat.

The reason for this is what happens in GpuStatisticsCollection.scala.

    // On file initialization/stat recomputation TIGHT_BOUNDS is always set to true
    val tightBoundsColOpt = if (deletionVectorsSupported &&
        !RapidsDeltaUtils.getTightBoundColumnOnFileInitDisabled(spark)) {
      Some(lit(true).as("tightBounds"))
    } else {
      None
    }

deletionVectorSupported is deemed false, for a newly created file.

The deletionVectorSupported is set from GpuOptimisticTransaction.scala.

override val deletionVectorsSupported =
          protocol.isFeatureSupported(DeletionVectorsTableFeature)

At some point, we find that the Delta Log protocol lands up being switched from (3,7) (i.e. supporting deletion vectors) to (1,2) (i.e. no deletion vector support).

It turns out that this happens seemingly as a side-effect of calling update() on the DeltaLog.

  def this(deltaLog: DeltaLog, rapidsConf: RapidsConf)(implicit clock: Clock) = {
    this(deltaLog, deltaLog.update(), rapidsConf)
  }

So shouldn't this failure also happen when running this test on Delta IO (2.4) and vanilla Apache Spark (say 3.4.3)? No, because that platform does not support deletion vectors by default. I have verified that when deletion vectors are enabled, then the same problem occurs on Spark 3.4.3. i.e. The stat goes missing when we write with spark-rapids, but doesn't through the CPU.
Note: Deletion vectors can be enabled on Delta IO, via the following:

spark.conf.set("spark.databricks.delta.properties.defaults.enableDeletionVectors", true)

The conclusion is that a fix here for Databricks 14.3 should apply uniformly to Spark 3.4.3 as well. We'd have to detect that deletion vectors are supported/enabled on the platform, and write the tightBounds stat out.

@mythrocks
Copy link
Collaborator Author

mythrocks commented Feb 4, 2025

I have verified that when deletion vectors are enabled, then the same problem occurs on Spark 3.4.3. i.e. The stat goes missing when we write with spark-rapids, but doesn't through the CPU.

Another bit of information: When the same test is run on Spark 3.4.3 + Delta.io 2.4, but with no settings specified for enableDeletionVectors, the snapshot protocol indicates versions (1,2) instead of (3,7). tightBounds lands up being skipped.

@razajafri, I think we might have come to an incorrect conclusion about the behaviour: We thought that the DeltaLog.update() was causing the protocol version to revert from (3,7) to (1,2). That isn't the case on CPU; Depending on the value of enableDeletionVectors, the protocol starts off as either (3,7) or (1,2), and then remains unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants