-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: more clearly delineate between TableParquetOptions
and ParquetWriterOptions
#11444
Refactor: more clearly delineate between TableParquetOptions
and ParquetWriterOptions
#11444
Conversation
…s vs ParquetOptions vs WriterProperties
Possibly related: #11367 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld. I think this PR is a good step towards #11367 (it almost does it all).
I had a few structural / naming suggestions but otherwise I think the basic ideas here
as the ParquetOptions has an incomplete set of all the config needed for the arrow writer (it's missing the kv_metadata & col-specific configurations).
I wonder if we should make ParquetOptions
be consistent with the WriterOptions
? It seems like missing kv_metadata and column overrides might be an oversight rather than an intentional design)
datafusion/common/src/config.rs
Outdated
} | ||
} | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could also add a test here that described in #11367 -- that takes a default ParquetOptions
, creates the ArrowWriterOption
and then ensures the fields are the same as ArrowWriterOptions::default()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this^^ requested test in this commit, but did the equivalency assertion a bit differently. Parquet's WriterProperties
and it's builder do not implement Eq or PartialEq.
Rather than asserting each prop independently based upon the getter, I elected to (a) explicitly define the props that do not match, and then (b) use the same identify function (session_config_from_writer_props
) for the assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to make a slight change to avoid a test regression in the sqllogictests.
That^^ change (to avoid test regression) also brings us further from the matching-default-settings between the extern parquet vs datafusion. I added more tests to demonstrate this divergence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the follow up PR, I plan to propose (and open for discussion) which defaults should be changed in datafusion vs parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per request, moving these into a follow up PR. (Make more reviewable.)
The datafusion/datafusion/common/src/config.rs Lines 1395 to 1397 in 8f8df07
Is there any reason to have the global ParquetOptions, without the TableParquetOptions?
|
The SessionState contains multiple copies of the ParquetOptions:
The
. But what is the intended use case for table_options vs file_formats?
|
I am not entirely sure. Given it is not clear from the code, what I would normally do in these circumstances is make a new PR with documentation to those structures and try and explain the differences It would likely take me a while, but the process of trying to document the differences would force me to do the code archeology / spelunking to figure it out, and then once I did I would have a good artifact (new docs) that would
I of course would love to help review such a PR 🎣 |
ok, I'm going to:
Then in separate tickets, and separate followup PRs, will handle these two proposals: cleanup ParquetOptions vs TableParquetOptions, and the proposed cleanup of file_formats. |
Thanks @wiedld A suggestion that I think will help speed up reviews / merging is to try and keep the PRs smaller (like for example make one PR to add docs, and other PRs to add new apis (like conversion from |
…ubmodule, leaving only the config options in the config module.
… column configuration when none exists
a484974
to
9c43625
Compare
…to be left as None
…ired to avoid test regression), result in different default behavior than parquet crate
ab950c5
to
4f43390
Compare
TableParquetOptions
and ParquetWriterOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld
I think this is a step forward and improves the situation
I had a few API suggestions that I think would make it even better but I don't think they are required
Thanks @wiedld |
…arquetWriterOptions` (apache#11444) * refactor: make more explicit the relationship btwn TableParquetOptions vs ParquetOptions vs WriterProperties * test: demonstrate the relationship btwn session configs and writer props * refactor: move parquet-format specific functionality to the parquet submodule, leaving only the config options in the config module. * test: update test fixtures to use the ParquetOptions::default * test: update test helper session_config_from_writer_props, to not add column configuration when none exists * test(11367): write test to demonstrate issue 11367 * fix: existing sqllogictests require specific ParquetOptions settings to be left as None * test(11367): demonstrate how the require bloom filter defaults, (required to avoid test regression), result in different default behavior than parquet crate * chore: make more reviewable, by pulling tests for issue 11367 into followup PR * refactor: move all parquet-associated features into parquet-writer mod * chore: better function naming convention
…arquetWriterOptions` (apache#11444) * refactor: make more explicit the relationship btwn TableParquetOptions vs ParquetOptions vs WriterProperties * test: demonstrate the relationship btwn session configs and writer props * refactor: move parquet-format specific functionality to the parquet submodule, leaving only the config options in the config module. * test: update test fixtures to use the ParquetOptions::default * test: update test helper session_config_from_writer_props, to not add column configuration when none exists * test(11367): write test to demonstrate issue 11367 * fix: existing sqllogictests require specific ParquetOptions settings to be left as None * test(11367): demonstrate how the require bloom filter defaults, (required to avoid test regression), result in different default behavior than parquet crate * chore: make more reviewable, by pulling tests for issue 11367 into followup PR * refactor: move all parquet-associated features into parquet-writer mod * chore: better function naming convention
Which issue does this PR close?
Is a cleanup before doing #11367.
This PR delineates btwn the session-level config vs the per-writer config.
It also adds tests to demonstrate how these two configurations get out-of-sync.Tests in the followup PR.Rationale for this change
TableParquetOptions
andParquetOptions
.WriterProperties
and ParquetWriterOptions (which wrapsWriterProperties
).At first, it looks like we should be able to get the writer props from either session-level config options. (In fact, we made exactly this suggestion ourselves.) But this is misleading; as the
ParquetOptions
has an incomplete set of all the config needed for the arrow writer (it's missing the kv_metadata & col-specific configurations).What changes are included in this PR?
Instead, I've done some cleanup and added documentation to hopefully make the relationship more clear. Additionally, there's a new
ParquetOptions::writer_props_from_global_opts()
that does fulfill our suggestion -- while still making the difference clear.Are these changes tested?
Yes.
Are there any user-facing changes?
A new
ParquetOptions::writer_props_from_global_opts()
.