-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use single file write when an extension is present in the path. #13079
Conversation
Thanks @dhegberg -- I plan to review this later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dhegberg -- this is a really nice PR -- I think the code and tests are well written.
Thank you 🙏
cc @progval
I also tried it locally with datafusion-cli and it works as expected 👌
> copy (values (1), (2)) to '/tmp/foo' STORED AS parquet;
+-------+
| count |
+-------+
| 2 |
+-------+
1 row(s) fetched.
Elapsed 0.030 seconds.
>
\q
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/datafusion-cli$ ls -ltr /tmp/foo
total 8
-rw-r--r--@ 1 andrewlamb wheel 342B Nov 1 12:31 MrzgxU8HT1fn3wTB_0.parquet
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/datafusion-cli$
@@ -493,4 +506,54 @@ mod tests { | |||
"path not ends with / - fragment ends with / - not collection", | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_file_extension() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice tests 👏
@@ -2338,6 +2298,140 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice too
Here is a small follow on PR to add some more docs #13216 (really get the great writeup you did on this PR into the code) |
I suspect this introduced a regression - would appreciate your opinion on #13323 |
Which issue does this PR close?
Closes #9684.
Rationale for this change
Dataframe's
write_parquet()
was identified as incorrectly identifying paths without an extension as a single file output.This change updates
start_demuxer_task
to respect the suggested behaviour:What changes are included in this PR?
file_extension()
toListingTableUrl
to return an Optional extensionstart_demuxer_task()
to require the presence of an extension from theListingTableUrl
to setsingle_file_output
to truefile_extension
todefault_extension
to indicate usage will be ignored ifsingle_file_output
is triggered.Are these changes tested?
file_extension()
Dataframe.write_parquet()
for paths with and without extensions.start_demuxer_task
since there was no direct testing originally. I can revise and test this directly if preferred.Testing via
cargo test -- --test-threads=1
Are there any user-facing changes?