-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Don't Merge] Setting to control delta job count for each delta write #2031
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not solving a root problem of a large file begin loaded at once.
AFAIK delta-rs is accepting arrow batches and for sure it is accepting data frames,
IMO
- keep the original code with a dataset
- add a setting ie batch size to split arrow dataset into batches. we should possibly use multiples of row groups as when we batch in normalizer
- you feed batches into delta
this way we conserve memory. tbh. we were sure that by using dataset we allow to do efficient batching inside the library. it seems the opposite is true
@rudolfix the underlying issue in the rust implementation is being worked on here: delta-io/delta-rs#2289. I'm not sure we should fix this ourselves... My worry when we feed in batches is, that if there job somehow aborts for transient reasons, then the whole reference job will be marked as unloaded and all rows of all files are loaded again on the next try and you can end up with double entries. if we split this up into multiple jobs, this will not happen, or if it does, the amount of double rows is less. |
Regarding the suggested solution:
My take:
As a workaround for now, users can prevent memory issues in the |
@Gilbert09 is loading the table in multiple loads and option until the rust implementation is fixed? I tend to agree with @jorritsandbrink that the rust implementation should behave properly. @jorritsandbrink fyi: @Gilbert09 is loading large tables with the merge write_disposition. |
No, this doesn't work right now, I'm having to force users to use
It's certainly slower - there's a non-nominal cost in overheads coming from somewhere between each job (likely a delta-rs overhead) In theory, this doesn't actually need to be merged. I've solved half the issue by monkey patching our DLT implementation (see PostHog/posthog#26040) - the real solution here is the rust implementation being fixed which looks like has some progress now |
@Gilbert09 alright, then I think what I'll do is add a warning to the docs page for now and not merge this. |
right! we'll need to use transactions then possibly accumulating all data again in the memory... waiting for delta-rs fix is probably the best way. should we close this PR |
closing as won't fix (or rather will be fixed by delta rs hopefully) |
Description
NOTE: we'll keep this here in case other users have this problem before the rust implementation is fixed.
See the linked ticket. If we write all jobs at once, apparently the rust implementation load all jobs into memory. If large tables are loaded, this will end up creating a large memory footprint.
We may want to add a setting that controls what the max size of all jobs combined in on write operation maybe. For that we would have to read the filesize of each job from the filesystem, but that is doable. I am not sure.
ToDo: