Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: DataFrameWriteOptions::with_single_file_output produces a directory #13323

Open
sergiimk opened this issue Nov 9, 2024 · 6 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers regression Something that used to work no longer does

Comments

@sergiimk
Copy link
Contributor

sergiimk commented Nov 9, 2024

Describe the bug

Consider a snippet like this:

df.write_parquet(
  "dir/data",
  DataFrameWriteOptions::new().with_single_file_output(true),
  None
).await

Before v43 this would write a single file called data, but in v43 this is creating data as a directory with a randomly named file(s) in it.

This seems to be related to #13079 (cc @dhegberg) that added an extension-based heuristic.

I see this as a regression, as single file output is requested explicitly, and I don't want a heuristics to be applied.

We are using Parquet files with a content-addressable file system and our files don't have extensions.

To Reproduce

See above

Expected behavior

Considering the introduction of the extension-based heuristic I would suggest the following behavior:

  • with_single_file_output is not called (single_file_output == None) - apply the heuristic
  • with_single_file_output(true) - produce a single file at the exact path specified
  • with_single_file_output(false) - create directory under specified path if doesn't exist and write one or many files

Additional context

@sergiimk sergiimk added the bug Something isn't working label Nov 9, 2024
@alamb alamb added regression Something that used to work no longer does good first issue Good for newcomers labels Nov 13, 2024
@alamb
Copy link
Contributor

alamb commented Nov 13, 2024

I agree this is a regression. Thank you for the callout @sergiimk

I think this is a pretty good first issue for someone as the description is clear and the need is well defined.

@irenjj
Copy link
Contributor

irenjj commented Nov 13, 2024

take

@irenjj
Copy link
Contributor

irenjj commented Nov 18, 2024

It seems hard to control the behavior of write_parquet by single_file_output(and I've noticed that It's never used), what really controls whether to generate a single file output is determining the suffix(in start_demuxer_task()), there are several methods I can think of to handle this issue:

  1. We can add a suffix like .single to the paths that require generating a single file, and then recognize this suffix in start_demuxer_task().
  2. Give up single_file_output in DataFrameWriteOptions, use FileSinkConfig instead to control single file behavior.

cc @alamb @sergiimk @dhegberg

@sergiimk
Copy link
Contributor Author

sergiimk commented Nov 19, 2024

Did some digging and found this old PR #9041 (cc @yyy1000) that seems to have removed single_file_output flag from FileSinkConfig - worth looking into it to understand the reasoning, not to undo the changes.

Looking at v42 code it does indeed seem that DataFrameWriteOptions::single_file_output is not read anywhere and the logic relies on just this condition in demux:

let single_file_output = !base_output_path.is_collection();

which in v43 became:

let single_file_output = !base_output_path.is_collection() && base_output_path.file_extension().is_some();

The DataFrameWriteOptions::new().with_single_file_output(true) is used in a bunch of tests though, so it's just a lucky coincidence that all tests give file a proper test.parquet name and not just test.

Personally I think that all kinds of extension-based heuristics don't belong in such low level code like start_demuxer_task and perhaps better left at the DataFrame level.

Whichever heuristic version (pre v36, pre v43, or post v43) is the right one - I don't really mind, but I think there should be a way to skip it and specify explicitly.

@irenjj
Copy link
Contributor

irenjj commented Nov 19, 2024

It seems that the previous PR intentionally removed the single_file_output option, but later introduced a heuristic.

@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

I feel like i reviewed a PR recently related to this issue but could not find it. I wonder if it is still valid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers regression Something that used to work no longer does
Projects
None yet
Development

No branches or pull requests

3 participants