Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add unsafe option to disable UTF8 validation on parquet read #6701

Open
alamb opened this issue Nov 7, 2024 · 1 comment
Open
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Nov 7, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

It is a pretty common usecase for systems (e.g. InfluxDB 3.0) to ingest data and write it as parquet files and then only ready parquet files that we wrote (and this are well formed)

Doing utf8 validation in the parquet reader take a significant amount of time for string data, likely leading to proposals to improve it like

However if the usecase is to speed up parquet data that is "trusted" maybe validating at all is unecessary.

Describe the solution you'd like
Add an unsafe parquet API to disable UTF8 validation during read for users

Describe alternatives you've considered

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Nov 7, 2024
@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2024

We obviously wouldn't use this in DataFusion for clickbench (as it would basically be cheating) but we might very well do so internally in InfluxDB 3.0 or other places where re-validating the same UTF8 data we already wrote consumes CPU unecessarly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

1 participant