Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filters on RANDOM() are applied incorrectly when pushdown_filters is enabled. #13268

Closed
adamfaulkner-at opened this issue Nov 6, 2024 · 7 comments · Fixed by #13475
Closed
Assignees
Labels
bug Something isn't working

Comments

@adamfaulkner-at
Copy link

Describe the bug

When running a query like

SELECT * FROM table WHERE RANDOM() < 0.1;

I get different results depending on the value of "datafusion.execution.parquet.pushdown_filters". When this setting is turned off, I get the results I expect, roughly 10% of the rows in the table. When it is turned on, I think I'm seeing 1% of the rows in the table.

I suspect I'm seeing these results because pushdown with TableProviderFilterPushDown::Inexact is applying this filter at both the parquet level and a FilterExec: random() <= 0.1. This results in the RANDOM() filter being evaluated twice, which causes fewer rows to be sampled.

To Reproduce

This can be reproduced with datafusion-cli version 42.2.0:

Without pushdown_filters

> create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/';
> select COUNT(*) from data WHERE RANDOM() < 0.1;
+----------+
| count(*) |
+----------+
| 605572   |
+----------+
1 row(s) fetched.
Elapsed 0.043 seconds.

With pushdown_filters (note that you must re-create the table with the updated setting):

> set datafusion.execution.parquet.pushdown_filters=true;
0 row(s) fetched.
Elapsed 0.002 seconds.

> create external table data stored as parquet location '/Users/adam.faulkner/Downloads/parquet_data/';
0 row(s) fetched.
Elapsed 0.007 seconds.

> select COUNT(*) from data WHERE RANDOM() < 0.1;
+----------+
| count(*) |
+----------+
| 60152    |
+----------+
1 row(s) fetched.
Elapsed 0.045 seconds.

Expected behavior

I would expect that a filter on RANDOM() would be applied only once, so that RANDOM() < 0.1 means that only 10% of all rows will be sampled.

It would be acceptable if RANDOM() was no longer eligible for pushdown, though I suspect this leaves a negligible amount of performance on the table compared to the alternative.

It feels like the "right" solution is to somehow guarantee that RANDOM() always returns the same value for a given row and query evaluation, perhaps by "caching" its values.

Additional context

In my custom TableProvider, I tried using ``TableProviderFilterPushDown::Exact` for these filters, and I get the results that I expect. However, it seems that this is only because my filter is really simple.

@adamfaulkner-at adamfaulkner-at added the bug Something isn't working label Nov 6, 2024
@findepi
Copy link
Member

findepi commented Nov 6, 2024

random() (and other violatile functions) shouldn't be handed over to TableProvider as a filter, because it's unnecessarily complicated to do a correct thing with them.
We might need to replace random() < f with a tablesample.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2024

I agree something is wrong with volatile expression pushdown -- thank you for the report @adamfaulkner-at

@theirix
Copy link
Contributor

theirix commented Nov 9, 2024

A similar issue with volatile functions was tackled in #13128. I am trying to investigate more.

@theirix
Copy link
Contributor

theirix commented Nov 9, 2024

take

@theirix
Copy link
Contributor

theirix commented Nov 24, 2024

This issue is fixed by avoiding pushing down volatile filters, so the sampling is achieved by a manual random() < 0.1 expression.

@findebi, regarding the sampling support, it would be great to have it supported for select statements.
DuckDB and PostgreSQL support it in the form of a SELECT * FROM tbl TABLESAMPLE SYSTEM (10 PERCENT), and Clickhouse with SELECT * FROM tbl SAMPLE 0.1. DataFusion could support it by using sampling directly in the parquet reader. I've only seen a single mention of this functionality in #11554. What do you think?

@findepi
Copy link
Member

findepi commented Nov 25, 2024

I think it makes sense to add TABLESAMPLE. @theirix would you want to create an issue about this?
fwiw here is Trino docs on the topic: https://trino.io/docs/current/sql/select.html#tablesample

@theirix
Copy link
Contributor

theirix commented Nov 25, 2024

I think it makes sense to add TABLESAMPLE. @theirix would you want to create an issue about this? fwiw here is Trino docs on the topic: https://trino.io/docs/current/sql/select.html#tablesample

Sure, submitted #13563 for the following discussion and implementation plans

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants