-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support data source sampling with TABLESAMPLE #13563
Comments
I looked around in sqlparser-rs briefly and it seems this syntax is not yet supported https://github.com/search?q=repo%3Aapache%2Fdatafusion-sqlparser-rs%20tablesample&type=code (though the keyword is) I suggest adding support for this clause in sqlparser first and then we can add a rewrite pass that converts the datafusion/datafusion/sql/src/statement.rs Lines 1068 to 1089 in 000288c
|
Thank you for the initial analysis. I will first submit a PR to datafusion-sqlparser-rs with extended grammar. |
take |
Is your feature request related to a problem or challenge?
It is helpful to have sampling support for queries to ease the exploration of data.
Describe the solution you'd like
It should be supported on the SQL level (
SAMPLE
orTABLESAMPLE
syntax). The sampling construct should be passed to the table source so the sampling is performed at the scan plan (e.g. in an optimised parquet reader).This feature could be implemented in three sequential stages:
WHERE RANDOM() < P
filterDescribe alternatives you've considered
It is possible to use
WHERE RANDOM() < 0.1
selection (see discussion #13268 ), but the support in SQL is clearer.Existing query engines and databases already implement sampling, but it is not in ANSI standard. There are different flavours, but essentially, they allow for specific sampling methods and percentages (or sometimes a number of rows)
TABLESAMPLE [SYSTEM | BERNOULLI] (PERCENTAGE | ROWS)
DuckDB:
PostgreSQL and Trino:
Spark
Clickhouse is different:
Additional context
Also requested in #11554. The filter for sampling was refined in #13268.
The text was updated successfully, but these errors were encountered: