[BUG] [DB 14.3] `tightBounds` stat in Delta Lake tables is set incorrectly #12027

mythrocks · 2025-01-25T00:49:38Z

This has to do with the following failing tests, described in #11541:

1. delta_lake_write_test.py::test_delta_write_round_trip_unmanaged
2. delta_lake_write_test.py::test_delta_overwrite_round_trip_unmanaged
3. delta_lake_write_test.py::test_delta_append_round_trip_unmanaged
4. delta_lake_write_test.py::test_delta_write_round_trip_cdf_write_opt
5. delta_lake_write_test.py::test_delta_write_aqe_join

In all these tests, one sees that a delta-lake table written with spark-rapids does not correctly set the tightBounds stat correctly in the delta-lake meta files.

Repro

Write a simple delta table out on Databricks 14.3:

spark.range(0, 10).toDF("id").coalesce(1).write.format("delta").save("/tmp/delta_test")

In the table's _delta_log/00000000000000000000.json, one sees:

"stats": "{\"numRecords\":10,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":9},\"nullCount\":{\"id\":0},\"tightBounds\":true}"

The same stats, for a table written from spark-rapids, looks like:

"stats": "{\"numRecords\":10,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":9},\"nullCount\":{\"id\":0}}"

Note that the tightBounds stat goes missing.

Without this stat, the tables can't be deemed equivalent, and the tests fail.

The text was updated successfully, but these errors were encountered:

mythrocks · 2025-01-25T00:54:49Z

DB 14.3 might be the first time we've run on a platform that supports deletion vectors. From tracing, it appears that this doesn't have to do with the conf value set for deletionVectors.disableTightBoundOnFileCreationForDevOnly. (Aside: This doesn't seem an exemplary config name; it returns "true" if the feature is disabled. )

It appears that the failure is the result of the shim not being detected as capable of supporting deletion vectors. I'm still investigating.

mythrocks · 2025-01-28T07:26:09Z

I'm seeing some truly baffling behaviour, and discrepancies between successive runs of the following test code:

spark.range(1, 19).toDF("id").write.mode("overwrite").format("delta").save("/tmp/gpu_delta_out")

On the first run (when the output directory does not exist), deletionVectorSupported == false.
On the very next run, deletionVectorSupported == true. Thereafter, tightBounds is either set or unset, depending on the value of deletionVectors.disableTightBoundOnFileCreationForDevOnly. I've seen both, and I'm not sure where the setting comes from.

I'll hit this afresh tomorrow morning.

revans2 · 2025-01-28T21:09:44Z

If tightBounds is set it indicates that the min/max values stored in the parquet file and/or the metadata are exact. Meaning the max value that would be read is truly the maximum value in the file/row group. If a row is deleted using deletion vectors then technically the min/max bounds are no longer tight. They might be wider than what would happen if the file were read. tightBounds only makes since in the context where something can be added or removed without actually modifying the file. Not sure why it might be different for different runs. It should always be set if the table is being created, but I don't understand the disable...ForDevOnly command or how we would be using it in the tests.

mythrocks · 2025-01-28T21:23:52Z

It should always be set if the table is being created, but I don't understand the disable...ForDevOnly command or how we would be using it in the tests.

The documentation in the code wasn't the easiest to follow. This looks like an internal configuration for spark-rapids testing.

The non-deterministic behaviour I was referring to: On some GPU runs, I found that deletionVectors.disableTightBoundOnFileCreationForDevOnly was being set implicitly. I couldn't determine where, last night. I'm debugging now.

For completeness, here is the definition of deletionVectors.disableTightBoundOnFileCreationForDevOnly:
https://github.com/delta-io/delta/blob/a6db4e05d943fec019522c9c08ff0b64125cbc5d/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala#L2118

mythrocks · 2025-01-30T07:11:42Z

A little bit of progress.

The problem is seen when a delta table file is written as follows:

spark.range(0, 10).toDF("id").coalesce(1).write.format("delta").save("/tmp/delta_test")

The _delta_log/*json metadata does not contain the tightBounds stat when written by spark-rapids, while Databricks does write this stat.

The reason for this is what happens in GpuStatisticsCollection.scala.

    // On file initialization/stat recomputation TIGHT_BOUNDS is always set to true
    val tightBoundsColOpt = if (deletionVectorsSupported &&
        !RapidsDeltaUtils.getTightBoundColumnOnFileInitDisabled(spark)) {
      Some(lit(true).as("tightBounds"))
    } else {
      None
    }

deletionVectorSupported is deemed false, for a newly created file.

The deletionVectorSupported is set from GpuOptimisticTransaction.scala.

override val deletionVectorsSupported =
          protocol.isFeatureSupported(DeletionVectorsTableFeature)

At some point, we find that the Delta Log protocol lands up being switched from (3,7) (i.e. supporting deletion vectors) to (1,2) (i.e. no deletion vector support).

It turns out that this happens seemingly as a side-effect of calling update() on the DeltaLog.

  def this(deltaLog: DeltaLog, rapidsConf: RapidsConf)(implicit clock: Clock) = {
    this(deltaLog, deltaLog.update(), rapidsConf)
  }

So shouldn't this failure also happen when running this test on Delta IO (2.4) and vanilla Apache Spark (say 3.4.3)? No, because that platform does not support deletion vectors by default. I have verified that when deletion vectors are enabled, then the same problem occurs on Spark 3.4.3. i.e. The stat goes missing when we write with spark-rapids, but doesn't through the CPU.
Note: Deletion vectors can be enabled on Delta IO, via the following:

spark.conf.set("spark.databricks.delta.properties.defaults.enableDeletionVectors", true)

The conclusion is that a fix here for Databricks 14.3 should apply uniformly to Spark 3.4.3 as well. We'd have to detect that deletion vectors are supported/enabled on the platform, and write the tightBounds stat out.

mythrocks · 2025-02-04T22:41:21Z

I have verified that when deletion vectors are enabled, then the same problem occurs on Spark 3.4.3. i.e. The stat goes missing when we write with spark-rapids, but doesn't through the CPU.

Another bit of information: When the same test is run on Spark 3.4.3 + Delta.io 2.4, but with no settings specified for enableDeletionVectors, the snapshot protocol indicates versions (1,2) instead of (3,7). tightBounds lands up being skipped.

@razajafri, I think we might have come to an incorrect conclusion about the behaviour: We thought that the DeltaLog.update() was causing the protocol version to revert from (3,7) to (1,2). That isn't the case on CPU; Depending on the value of enableDeletionVectors, the protocol starts off as either (3,7) or (1,2), and then remains unchanged.

mythrocks added ? - Needs Triage Need team to review and classify bug Something isn't working and removed ? - Needs Triage Need team to review and classify labels Jan 25, 2025

mythrocks self-assigned this Jan 25, 2025

mythrocks mentioned this issue Feb 1, 2025

Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] #12048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [DB 14.3] `tightBounds` stat in Delta Lake tables is set incorrectly #12027

[BUG] [DB 14.3] `tightBounds` stat in Delta Lake tables is set incorrectly #12027

mythrocks commented Jan 25, 2025

mythrocks commented Jan 25, 2025 •

edited

Loading

mythrocks commented Jan 28, 2025

revans2 commented Jan 28, 2025

mythrocks commented Jan 28, 2025 •

edited

Loading

mythrocks commented Jan 30, 2025 •

edited

Loading

mythrocks commented Feb 4, 2025 •

edited

Loading

[BUG] [DB 14.3] tightBounds stat in Delta Lake tables is set incorrectly #12027

[BUG] [DB 14.3] tightBounds stat in Delta Lake tables is set incorrectly #12027

Comments

mythrocks commented Jan 25, 2025

Repro

mythrocks commented Jan 25, 2025 • edited Loading

mythrocks commented Jan 28, 2025

revans2 commented Jan 28, 2025

mythrocks commented Jan 28, 2025 • edited Loading

mythrocks commented Jan 30, 2025 • edited Loading

mythrocks commented Feb 4, 2025 • edited Loading

[BUG] [DB 14.3] `tightBounds` stat in Delta Lake tables is set incorrectly #12027

[BUG] [DB 14.3] `tightBounds` stat in Delta Lake tables is set incorrectly #12027

mythrocks commented Jan 25, 2025 •

edited

Loading

mythrocks commented Jan 28, 2025 •

edited

Loading

mythrocks commented Jan 30, 2025 •

edited

Loading

mythrocks commented Feb 4, 2025 •

edited

Loading