-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filters on RANDOM()
are applied incorrectly when pushdown_filters is enabled.
#13268
Comments
random() (and other violatile functions) shouldn't be handed over to TableProvider as a filter, because it's unnecessarily complicated to do a correct thing with them. |
I agree something is wrong with volatile expression pushdown -- thank you for the report @adamfaulkner-at |
A similar issue with volatile functions was tackled in #13128. I am trying to investigate more. |
take |
This issue is fixed by avoiding pushing down volatile filters, so the sampling is achieved by a manual @findebi, regarding the sampling support, it would be great to have it supported for select statements. |
I think it makes sense to add TABLESAMPLE. @theirix would you want to create an issue about this? |
Sure, submitted #13563 for the following discussion and implementation plans |
Describe the bug
When running a query like
I get different results depending on the value of
"datafusion.execution.parquet.pushdown_filters"
. When this setting is turned off, I get the results I expect, roughly 10% of the rows in the table. When it is turned on, I think I'm seeing 1% of the rows in the table.I suspect I'm seeing these results because pushdown with
TableProviderFilterPushDown::Inexact
is applying this filter at both the parquet level and aFilterExec: random() <= 0.1
. This results in theRANDOM()
filter being evaluated twice, which causes fewer rows to be sampled.To Reproduce
This can be reproduced with
datafusion-cli
version 42.2.0:Without
pushdown_filters
With
pushdown_filters
(note that you must re-create the table with the updated setting):Expected behavior
I would expect that a filter on
RANDOM()
would be applied only once, so thatRANDOM() < 0.1
means that only 10% of all rows will be sampled.It would be acceptable if
RANDOM()
was no longer eligible for pushdown, though I suspect this leaves a negligible amount of performance on the table compared to the alternative.It feels like the "right" solution is to somehow guarantee that
RANDOM()
always returns the same value for a given row and query evaluation, perhaps by "caching" its values.Additional context
In my custom TableProvider, I tried using ``TableProviderFilterPushDown::Exact` for these filters, and I get the results that I expect. However, it seems that this is only because my filter is really simple.
The text was updated successfully, but these errors were encountered: