v0.7.0
We are glad to announce availability for our next 0.7.0 release! With new performance gains, new APIs, and some bug fixes 🐛
What's changed
Reduce metadata memory footprint in #335
The impact of using the driver's memory to process the Qbeast Metadata in the Delta Log was very high compared to using DataFrames. Now we ensure the Log is read minimum times and processed in a distributed manner, leading to 10x faster queries with Sampling and more efficient writes (less information broadcasted and collected).
Updated packages: removal of qbeast-core
as a side effect of #335
Since we need to use DataFrame and Spark API's for processing Qbeast Metadata efficiently, we reduced the overhead of separating interfaces in another module.
Everything is unified into one single artifact: qbeast-spark
. This does not change anything on your Spark Configuration, but it simplifies the software architecture and dependency tree. 😄
Pre-commit hook feature #319
A good way to ensure only-once transactions with extra metadata is using Pre-commit hooks
.
We add an interface for implementing your code to run between writing of files and committing the information. The mechanism will add a reference in the CommitInfo
too.
Example:
- Create your custom hook
import io.qbeast.spark.delta.hook.PreCommitHook
import io.qbeast.spark.delta.hook.PreCommitHook.PreCommitHookOutput
import org.apache.spark.sql.delta.actions.Action
class SimpleHook extends PreCommitHook {
override val name: String = "SimpleHook"
override def run(actions: Seq[Action]): PreCommitHookOutput = {
Map("clsName" -> "SimpleHook")
}
}
- Write data with a
SimpleHook
import io.test.hook.SimpleHook
import spark.implicits._
val tmpDir = "/tmp/test"
val df = spark.sparkContext.range(0, 100).toDF()
(df
.write
.mode("append")
.format("qbeast")
.option("columnsToIndex", "value")
.option("qbeastPreCommitHook.hook", classOf[SimpleHook].getCanonicalName)
.save(tmpDir)
)
- Check results
cat /tmp/test/_delta_log/00000000000000000000.json | jq
{
"commitInfo": {
...,
"tags": {
"clsName": "SimpleHook"
},
...
}
}
You can find all the details in Advanced Configuration.
New API's
- Merged #330: New
IndexMetrics
statistics to account for the multi-block files.
OTree Index Metrics:
revisionId: 1
elementCount: 309008
dimensionCount: 2
desiredCubeSize: 3000
indexingColumns: price:linear,user_id:linear
height: 8 (4)
avgFanout: 3.94 (4.0)
cubeCount: 230
blockCount: 848
fileCount: 61
bytes: 11605372
Multi-block files stats:
cubeElementCountStats: (count: 230, avg: 1343, std: 1186, quartiles: (1,292,858,2966,3215))
blockElementCountStats: (count: 848, avg: 364, std: 829, quartiles: (1,6,21,42,3120))
fileBytesStats: (count: 61, avg: 190252, std: 57583, quartiles: (113168,139261,182180,215136,332851))
blockCountPerCubeStats: (count: 230, avg: 3, std: 1, quartiles: (1,4,4,4,4))
blockCountPerFileStats: (count: 61, avg: 13, std: 43, quartiles: (1,3,5,5,207))
...
- Merged #356: Retrieve Revision information from
QbeastTable
import io.qbeast.spark.QbeastTable
// Init QbeastTable
val qbeastTable = QbeastTable.forPath(spark, "/tmp/dir")
qbeastTable.allRevisions() // List of Revision
qbeastTable.revision(1) // Information of Revision with ID=1
qbeastTable.latestRevision // Information of latest available revision
- Merged #332: New Compute Histogram utility method.
import io.qbeast.spark.utils.QbeastUtils
val brandStats = QbeastUtils.computeHistogramForColumn(df, "brand", 50)
val statsStr = s"""{"brand_histogram":$brandStats}"""
(df
.write
.mode("overwrite")
.format("qbeast")
.option("columnsToIndex", "brand:histogram")
.option("columnStats", statsStr)
.save(targetPath))
Bug fixes
- Issue #337: Remove
compact()
operation. Useoptimize()
instead. - Issue #333: Broadcast cube weights during optimization file writing
- Issue #339: Persist property changes from SET TBLPROPERTIES operations in the _delta_log
- Issue #340: ConvertToQbeast should work for table paths containing namespaces
- Issue #344: Log sampling filtering stats only in debug mode.
- Issue #352: Sampling Error when explicit type is set in columnsToIndex
- Update .md qb-spark files
- Issue #360: Update README.md
- Issue #363: Update markdown file links
- Issue #365: Fix broken link in README
- Issue #366: TBLPROPERTIES consistency between log, catalog, and Qbeast internals
New contributors
- @alinagrebenkina made their first contribution at #353
- @JosepSampe made their first contribution in #362
- @jorgeMarin1 made their first contribution in #358
Full changelog: v0.6.0...v0.7.0