You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
When reading a delta table via URI (e.g. DeltaTableBuilder::from_uri), all json files in the _delta_log directory, which are after the current checkpoint are read twice.
What you expected to happen:
When reading a delta table, all json files in the _delta_log directory, which are after the current checkpoint should only be read once.
Especially when the file access is remotely and accesses object store buckets, reading things twice is an issue both in terms of performance and costs.
How to reproduce it:
Start the unit test test_load_table_read_delta_log from my fork: MartinKolbAtWork@e946422
The test uses an adapted ObjectStore implementation, which logs all file access to stdout.
It reads the standard test table from test/tests/data/simple_table and the output shows that the respective json files are read twice.
In my analysis, I could find out that the two reads are triggered from two subsequent steps in EagerSnapshot::try_new_with_visitor.
The call to Snapshot::try_new triggers the first sequence of reads.
This is tracked here: #2776. Currently we separately read the logs to fetch the metadata and protocol actions, and separately for the add actions, and there is no caching done yet
Environment
Delta-rs version:
latest main branch
Bug
What happened:
When reading a delta table via URI (e.g.
DeltaTableBuilder::from_uri
), all json files in the_delta_log
directory, which are after the current checkpoint are read twice.What you expected to happen:
When reading a delta table, all json files in the
_delta_log
directory, which are after the current checkpoint should only be read once.Especially when the file access is remotely and accesses object store buckets, reading things twice is an issue both in terms of performance and costs.
How to reproduce it:
Start the unit test
test_load_table_read_delta_log
from my fork:MartinKolbAtWork@e946422
The test uses an adapted ObjectStore implementation, which logs all file access to stdout.
It reads the standard test table from
test/tests/data/simple_table
and the output shows that the respective json files are read twice.In my analysis, I could find out that the two reads are triggered from two subsequent steps in
EagerSnapshot::try_new_with_visitor
.The call to
Snapshot::try_new
triggers the first sequence of reads.delta-rs/crates/core/src/kernel/snapshot/mod.rs
Line 373 in d686336
The call to
snapshot.files
triggers the second read cascade.delta-rs/crates/core/src/kernel/snapshot/mod.rs
Line 376 in d686336
In my commit containing the test, I augmented the respective lines with
println
to have these calls as reference in the output.The text was updated successfully, but these errors were encountered: