Skip to content

fix: repartitioned reads of CSV with custom line terminator #13677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 7, 2024

Conversation

korowa
Copy link
Contributor

@korowa korowa commented Dec 6, 2024

Which issue does this PR close?

Closes #12328.

Rationale for this change

At this moment DF is unable to properly identify ranges after file repartitioning for CSV files with custom line terminator (bound calculation function considers only \n as a line separating character).

Test failures are caused by target_partitions = 4 (default for Datafusion sqllogictests engine/runner) and set repartition_file_min_size = 1 in csv_files.slt executed before failing test -- these two settings trigger scan repartitioning, which is not supported.

What changes are included in this PR?

File scan repartitioning functions used for plain-text files (CSV and NDJson -- though json reader don't have line separator option in its API) now aware of line terminator character.

Are these changes tested?

Sqllogictests for single-thread reading and repartitioned reading of CSV with custom line separator.

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 6, 2024
@korowa korowa force-pushed the fix-custom-term-csv-repart branch from 6a7c341 to 94c41be Compare December 6, 2024 18:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @korowa -- this makes sense to me

I also ran the tests in this PR without the corresponding code change and verified the test failed (and thus covers the feature)

Running "csv_files.slt"
External error: query failed: DataFusion error: Object Store error: Generic LocalFileSystem error: Requested range was invalid
[SQL] select * from stored_table_with_cr_terminator order by col1;
at test_files/csv_files.slt:369

Error: Execution("1 failures")
error: test failed, to rerun pass `-p datafusion-sqllogictest --test sqllogictests`

Caused by:
  process didn't exit successfully: `/Users/andrewlamb/Software/datafusion/target/debug/deps/sqllogictests-7c41483884b46406 csv_file` (exit status: 1)

@korowa korowa merged commit 1b4c0a4 into apache:main Dec 7, 2024
25 checks passed
zhuliquan pushed a commit to zhuliquan/datafusion that referenced this pull request Dec 11, 2024
zhuliquan pushed a commit to zhuliquan/datafusion that referenced this pull request Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The file with non-standard newline character can't be read when sqllogictests testing
2 participants