Replies: 1 comment
-
Thank you to everyone that has been paying attention to the changes on qbeast-spark. 😄 Finally, we decided to develop #98: Compaction of small files. You can see the changes on #110 :) There's has been other enhancements and bug fixes during this summer, and we are happy to announce the upcoming release of Hope you all have a nice summer vacation! We have great things coming ahead 🙌 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Thank you and Welcome 😄
Hello everyone!
With the 0.2.0 release, we reached a great milestone. We want to take a moment to appreciate all the people that reached out to us or downloaded our software to start playing with Qbeast Format. Thank you for supporting qbeast-spark project!
Now it’s time to move forward, and we want to do it together :)
Overview & Purpose
The purpose of this discussion is to elaborate on a tentative roadmap for this summer. We will be covering and explaining the different features we want to include, and the discussion is up for timing and interest.
Objectives
There are different development objectives that we should achieve to make it more production-ready.
There’s no stupid question/suggestion, so feel free to add comments based on what’s more interesting for you or works best for your data environment :)
Compaction
Reference issue #98
In order to be compatible with future versions of Delta or other Table Formats such as Iceberg or Hudi, we should think about how to address the compaction of small files.
We can think again about different approaches:
Use the cube as a Partition Key. We treat all the files that belong to a single cube as a partition. In Delta Lake OPTIMIZE operation, we can specify a key for which to do the compaction.
E.g:
OPTIMIZE table WHERE cube = root
In that way, we can optimize only certain cubes that could have more data spread across different files.
Directly re-implement the method to compact all small files, independently if they belong to the same cube or not.
In both ways, we have to make sure that the correctness of metadata is maintained.
For example, if we collapse files F1 with
maxWeight
= 0.3 and F2 withmaxWeight
= 0.4 , the result file F3 should havemaxWeight
= 0.4. And same for all the Qbeast existing metadata.ConvertToQbeast
Reference issue #102
We can think of different approaches (that can be co-existing):
Doubts/things we need to figure out:
convertToQbeast(path, oldFormat, columnsToIndex, cubeSize)
Suggestion: We can handle partitioning by using the partition columns as columnsToIndex
Open Contribution
Feel free to comment on this discussion and other possible ideas!
Beta Was this translation helpful? Give feedback.
All reactions