Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Invalid UTF8 can be read from parquet (and probably other formats) #12177

Open
revans2 opened this issue Feb 19, 2025 · 1 comment
Open
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Feb 19, 2025

Describe the bug
A customer recently ran into a hang when they read a parquet file with invalid UTF-8 characters in it and then tried to process those with a regular expression. CUDF is looking into the hang itself, but we should be checking/normalizing UTF-8 input from parquet files, and possibly others in a way that is compatible with what java/spark does. Java will find invalid bytes and replace them with U+FFFD. Technically old IBM JDKs behaved differently than Sun JDKs, but with any JVM that can run spark this should be the behavior that we would expect to see.

We probably want to write two custom kernels for doing this as fast as possible. Because invalid UTF-8 should be rare, we should probably have a simple kernel that can do the validation in a byte parallel way and just flag that any string in the sequence was invalid. But for the fixup kernel we can be less concerned about performance, or even possibly do it on the CPU. But either way we don't want to copy the data for nested types unless we know that the input data needs to be changed in some way.

@revans2 revans2 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 19, 2025
@sameerz
Copy link
Collaborator

sameerz commented Feb 19, 2025

Fix for the hang in cudf: rapidsai/cudf#18039

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants