Roadmap Summer 2022 #108

osopardo1 · 2022-06-03T09:48:32Z

osopardo1
Jun 3, 2022
Maintainer

Thank you and Welcome 😄

Hello everyone!

With the 0.2.0 release, we reached a great milestone. We want to take a moment to appreciate all the people that reached out to us or downloaded our software to start playing with Qbeast Format. Thank you for supporting qbeast-spark project!
Now it’s time to move forward, and we want to do it together :)

Overview & Purpose

The purpose of this discussion is to elaborate on a tentative roadmap for this summer. We will be covering and explaining the different features we want to include, and the discussion is up for timing and interest.

Objectives

There are different development objectives that we should achieve to make it more production-ready.

Support for Compaction: Delta Lake roadmap for 2022 has started extraordinarily well. A lot of features were made available for public usage, and one of those is the Optimize SQL command for compacting small files into a single one and avoiding performance issues.
Convert to Qbeast command: programmatically convert the files in a directory into a Qbeast indexed table.
Other issues: support INSERT INTO command Include INSERT INTO in Qbeast save contract #5 , better groupCubeSize Raise estimatedCubeWeights group size #104

There’s no stupid question/suggestion, so feel free to add comments based on what’s more interesting for you or works best for your data environment :)

Compaction

Reference issue #98

In order to be compatible with future versions of Delta or other Table Formats such as Iceberg or Hudi, we should think about how to address the compaction of small files.

We can think again about different approaches:

Use the cube as a Partition Key. We treat all the files that belong to a single cube as a partition. In Delta Lake OPTIMIZE operation, we can specify a key for which to do the compaction.
E.g:
OPTIMIZE table WHERE cube = root
In that way, we can optimize only certain cubes that could have more data spread across different files.
Directly re-implement the method to compact all small files, independently if they belong to the same cube or not.

In both ways, we have to make sure that the correctness of metadata is maintained.
For example, if we collapse files F1 with maxWeight = 0.3 and F2 with maxWeight = 0.4 , the result file F3 should have maxWeight = 0.4. And same for all the Qbeast existing metadata.

ConvertToQbeast

Reference issue #102

We can think of different approaches (that can be co-existing):

Write the data in the same place but organized with the Qbeast index. If more data is added while the conversion is taking place, we are targeting this data as non-indexed and reading all of them in case we need it.
Write the data in the same place and mark it as replicated cubes. So we will only duplicate the data we need for optimizing.
Don’t write anything new. Start from now on to index the data, and when Optimization is called, write those files that do not have Qbeast metadata into the corresponding cubes.

Doubts/things we need to figure out:

How to specify the columns to index in the API? Suggestion: convertToQbeast(path, oldFormat, columnsToIndex, cubeSize)
How to handle traditional partitioning? Should be useful to index the columns that are in partition values?
Suggestion: We can handle partitioning by using the partition columns as columnsToIndex
Other design problems that could arise

Open Contribution

Feel free to comment on this discussion and other possible ideas!

osopardo1 · 2022-09-01T07:22:45Z

osopardo1
Sep 1, 2022
Maintainer Author

Thank you to everyone that has been paying attention to the changes on qbeast-spark. 😄

Finally, we decided to develop #98: Compaction of small files. You can see the changes on #110 :)

There's has been other enhancements and bug fixes during this summer, and we are happy to announce the upcoming release of v0.3.0 in the following weeks. This would update the versions of Apache Spark and Delta, supporting Spark 3.2.2 and Delta 1.2.0, where all the significant changes are available.

Hope you all have a nice summer vacation! We have great things coming ahead 🙌

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap Summer 2022 #108

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Roadmap Summer 2022 #108

osopardo1 Jun 3, 2022 Maintainer

Thank you and Welcome 😄

Overview & Purpose

Objectives

Compaction

ConvertToQbeast

Open Contribution

Replies: 1 comment

osopardo1 Sep 1, 2022 Maintainer Author

osopardo1
Jun 3, 2022
Maintainer

osopardo1
Sep 1, 2022
Maintainer Author