Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double-read of json files from _delta_log #2936

Closed
MartinKolbAtWork opened this issue Oct 10, 2024 · 1 comment
Closed

Double-read of json files from _delta_log #2936

MartinKolbAtWork opened this issue Oct 10, 2024 · 1 comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working

Comments

@MartinKolbAtWork
Copy link

MartinKolbAtWork commented Oct 10, 2024

Environment

Delta-rs version:
latest main branch


Bug

What happened:
When reading a delta table via URI (e.g. DeltaTableBuilder::from_uri), all json files in the _delta_log directory, which are after the current checkpoint are read twice.

What you expected to happen:
When reading a delta table, all json files in the _delta_log directory, which are after the current checkpoint should only be read once.
Especially when the file access is remotely and accesses object store buckets, reading things twice is an issue both in terms of performance and costs.

How to reproduce it:
Start the unit test test_load_table_read_delta_log from my fork:
MartinKolbAtWork@e946422

The test uses an adapted ObjectStore implementation, which logs all file access to stdout.
It reads the standard test table from test/tests/data/simple_table and the output shows that the respective json files are read twice.
In my analysis, I could find out that the two reads are triggered from two subsequent steps in EagerSnapshot::try_new_with_visitor.
The call to Snapshot::try_new triggers the first sequence of reads.

Snapshot::try_new(table_root, store.clone(), config.clone(), version).await?;

The call to snapshot.files triggers the second read cascade.
true => snapshot.files(store, &mut visitors)?.try_collect().await?,

In my commit containing the test, I augmented the respective lines with println to have these calls as reference in the output.

---- test_load_table_read_delta_log stdout ----
Starting Snapshot::try_new
PrintStore::get: location: Path { raw: "_delta_log/_last_checkpoint" }
PrintStore::list_with_offset
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000004.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000003.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000002.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000001.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000000.json" }
Starting snapshot.files
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000004.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000003.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000002.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000001.json" }
PrintStore::get: location: Path { raw: "_delta_log/00000000000000000000.json" }
Finished
@MartinKolbAtWork MartinKolbAtWork added the bug Something isn't working label Oct 10, 2024
@rtyler rtyler added the binding/rust Issues for the Rust crate label Oct 10, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Oct 12, 2024

This is tracked here: #2776. Currently we separately read the logs to fetch the metadata and protocol actions, and separately for the add actions, and there is no caching done yet

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants