-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: DataFrameWriteOptions::with_single_file_output
produces a directory
#13323
Comments
I agree this is a regression. Thank you for the callout @sergiimk I think this is a pretty good first issue for someone as the description is clear and the need is well defined. |
take |
It seems hard to control the behavior of
|
Did some digging and found this old PR #9041 (cc @yyy1000) that seems to have removed Looking at v42 code it does indeed seem that let single_file_output = !base_output_path.is_collection(); which in v43 became: let single_file_output = !base_output_path.is_collection() && base_output_path.file_extension().is_some(); The Personally I think that all kinds of extension-based heuristics don't belong in such low level code like Whichever heuristic version (pre v36, pre v43, or post v43) is the right one - I don't really mind, but I think there should be a way to skip it and specify explicitly. |
It seems that the previous PR intentionally removed the |
I feel like i reviewed a PR recently related to this issue but could not find it. I wonder if it is still valid
|
Describe the bug
Consider a snippet like this:
Before v43 this would write a single file called
data
, but in v43 this is creatingdata
as a directory with a randomly named file(s) in it.This seems to be related to #13079 (cc @dhegberg) that added an extension-based heuristic.
I see this as a regression, as single file output is requested explicitly, and I don't want a heuristics to be applied.
We are using Parquet files with a content-addressable file system and our files don't have extensions.
To Reproduce
See above
Expected behavior
Considering the introduction of the extension-based heuristic I would suggest the following behavior:
with_single_file_output
is not called (single_file_output == None
) - apply the heuristicwith_single_file_output(true)
- produce a single file at the exact path specifiedwith_single_file_output(false)
- create directory under specified path if doesn't exist and write one or many filesAdditional context
The text was updated successfully, but these errors were encountered: