Skip to content

v0.7.0

Compare
Choose a tag to compare
@osopardo1 osopardo1 released this 28 Aug 12:50
· 0 commits to 71ee26208ee6c27f8e36c0a0f764f00c0fcf63d7 since this release
070cd67

We are glad to announce availability for our next 0.7.0 release! With new performance gains, new APIs, and some bug fixes 🐛

What's changed

Reduce metadata memory footprint in #335

The impact of using the driver's memory to process the Qbeast Metadata in the Delta Log was very high compared to using DataFrames. Now we ensure the Log is read minimum times and processed in a distributed manner, leading to 10x faster queries with Sampling and more efficient writes (less information broadcasted and collected).

Updated packages: removal of qbeast-core as a side effect of #335

Since we need to use DataFrame and Spark API's for processing Qbeast Metadata efficiently, we reduced the overhead of separating interfaces in another module.

Everything is unified into one single artifact: qbeast-spark. This does not change anything on your Spark Configuration, but it simplifies the software architecture and dependency tree. 😄

Pre-commit hook feature #319

A good way to ensure only-once transactions with extra metadata is using Pre-commit hooks.

We add an interface for implementing your code to run between writing of files and committing the information. The mechanism will add a reference in the CommitInfo too.

Example:

  1. Create your custom hook
import io.qbeast.spark.delta.hook.PreCommitHook
import io.qbeast.spark.delta.hook.PreCommitHook.PreCommitHookOutput
import org.apache.spark.sql.delta.actions.Action

class SimpleHook extends PreCommitHook {

  override val name: String = "SimpleHook"

  override def run(actions: Seq[Action]): PreCommitHookOutput = {
    Map("clsName" -> "SimpleHook")
  }

}
  1. Write data with a SimpleHook
import io.test.hook.SimpleHook
import spark.implicits._

val tmpDir = "/tmp/test"
val df = spark.sparkContext.range(0, 100).toDF()

(df
	.write
	.mode("append")
	.format("qbeast")
	.option("columnsToIndex", "value")
	.option("qbeastPreCommitHook.hook", classOf[SimpleHook].getCanonicalName)
	.save(tmpDir)
)
  1. Check results
cat /tmp/test/_delta_log/00000000000000000000.json | jq
{
  "commitInfo": {
    ...,
    "tags": {
      "clsName": "SimpleHook"
    },
    ...
  }
}

You can find all the details in Advanced Configuration.

New API's

  • Merged #330: New IndexMetrics statistics to account for the multi-block files.
OTree Index Metrics:
revisionId: 1
elementCount: 309008
dimensionCount: 2
desiredCubeSize: 3000
indexingColumns: price:linear,user_id:linear
height: 8 (4)
avgFanout: 3.94 (4.0)
cubeCount: 230
blockCount: 848
fileCount: 61
bytes: 11605372

Multi-block files stats:
cubeElementCountStats: (count: 230, avg: 1343, std: 1186, quartiles: (1,292,858,2966,3215))
blockElementCountStats: (count: 848, avg: 364, std: 829, quartiles: (1,6,21,42,3120))
fileBytesStats: (count: 61, avg: 190252, std: 57583, quartiles: (113168,139261,182180,215136,332851))
blockCountPerCubeStats: (count: 230, avg: 3, std: 1, quartiles: (1,4,4,4,4))
blockCountPerFileStats: (count: 61, avg: 13, std: 43, quartiles: (1,3,5,5,207))

...
  • Merged #356: Retrieve Revision information from QbeastTable
import io.qbeast.spark.QbeastTable

// Init QbeastTable
val qbeastTable = QbeastTable.forPath(spark, "/tmp/dir")

qbeastTable.allRevisions() // List of Revision
qbeastTable.revision(1) // Information of Revision with ID=1
qbeastTable.latestRevision // Information of latest available revision
  • Merged #332: New Compute Histogram utility method.
import io.qbeast.spark.utils.QbeastUtils

val brandStats = QbeastUtils.computeHistogramForColumn(df, "brand", 50)
val statsStr = s"""{"brand_histogram":$brandStats}"""

(df
  .write
  .mode("overwrite")
  .format("qbeast")
  .option("columnsToIndex", "brand:histogram")
  .option("columnStats", statsStr)
  .save(targetPath))

Bug fixes

  • Issue #337: Remove compact() operation. Use optimize() instead.
  • Issue #333: Broadcast cube weights during optimization file writing
  • Issue #339: Persist property changes from SET TBLPROPERTIES operations in the _delta_log
  • Issue #340: ConvertToQbeast should work for table paths containing namespaces
  • Issue #344: Log sampling filtering stats only in debug mode.
  • Issue #352: Sampling Error when explicit type is set in columnsToIndex
  • Update .md qb-spark files
  • Issue #360: Update README.md
  • Issue #363: Update markdown file links
  • Issue #365: Fix broken link in README
  • Issue #366: TBLPROPERTIES consistency between log, catalog, and Qbeast internals

New contributors

Full changelog: v0.6.0...v0.7.0