Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use single file write when an extension is present in the path. #13079

Merged
merged 3 commits into from
Nov 1, 2024

Conversation

dhegberg
Copy link
Contributor

Which issue does this PR close?

Closes #9684.

Rationale for this change

Dataframe's write_parquet() was identified as incorrectly identifying paths without an extension as a single file output.

This change updates start_demuxer_task to respect the suggested behaviour:

    tmp/dataset/ -> is a folder since it ends in /
    tmp/dataset -> is still a folder since it does not end in / but has no valid file extension
    tmp/file.parquet -> is a file since it does not end in / and has a valid file extension .parquet
    tmp/file.parquet/ -> is a folder since it ends in /

What changes are included in this PR?

  • Add file_extension() to ListingTableUrl to return an Optional extension
  • Update start_demuxer_task() to require the presence of an extension from the ListingTableUrl to set single_file_output to true
  • Rename file_extension to default_extension to indicate usage will be ignored if single_file_output is triggered.

Are these changes tested?

  • Unit tests added for file_extension()
  • Unit tests added for Dataframe.write_parquet() for paths with and without extensions.
  • No direct testing for start_demuxer_task since there was no direct testing originally. I can revise and test this directly if preferred.

Testing via cargo test -- --test-threads=1

Are there any user-facing changes?

  • Yes, the file output write behaviour is changing.

@github-actions github-actions bot added the core Core DataFusion crate label Oct 23, 2024
@alamb
Copy link
Contributor

alamb commented Oct 30, 2024

Thanks @dhegberg -- I plan to review this later today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dhegberg -- this is a really nice PR -- I think the code and tests are well written.

Thank you 🙏

cc @progval

I also tried it locally with datafusion-cli and it works as expected 👌

> copy (values (1), (2)) to '/tmp/foo' STORED AS parquet;
+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.030 seconds.

>
\q
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/datafusion-cli$ ls -ltr /tmp/foo
total 8
-rw-r--r--@ 1 andrewlamb  wheel   342B Nov  1 12:31 MrzgxU8HT1fn3wTB_0.parquet
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/datafusion-cli$

@@ -493,4 +506,54 @@ mod tests {
"path not ends with / - fragment ends with / - not collection",
);
}

#[test]
fn test_file_extension() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice tests 👏

@@ -2338,6 +2298,140 @@ mod tests {
Ok(())
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice too

@alamb alamb merged commit 87f0838 into apache:main Nov 1, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 1, 2024

Here is a small follow on PR to add some more docs #13216 (really get the great writeup you did on this PR into the code)

@sergiimk
Copy link
Contributor

sergiimk commented Nov 9, 2024

I suspect this introduced a regression - would appreciate your opinion on #13323

@dhegberg dhegberg deleted the write_files_when_extension branch December 14, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_parquet with path not ending in a slash writes to a file instead of a directory since v36
3 participants