Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further to the performance discussion @alamb - the StringBuilder pattern you suggested in https://github.com/apache/datafusion/pull/11136#discussion_r1657725214 does seem to materially improve performance: #11279

Closed
alamb opened this issue Jul 5, 2024 · 1 comment

Comments

@alamb
Copy link
Contributor

alamb commented Jul 5, 2024

          Further to the performance discussion @alamb - the StringBuilder pattern you suggested in https://github.com/apache/datafusion/pull/11136#discussion_r1657725214 does seem to materially improve performance:
Extract data page statistics for String/extract_statistics/String
                        time:   [15.368 µs 15.405 µs 15.446 µs]
                        change: [-68.672% -68.540% -68.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

So seems like a worthwhile thing to go ahead with? I think there are several places where we can do something similar.

One question - I notice in that ticket that you appended nulls for missing values. However, I think in most cases, missing values are simply omitted because all the None values are removed by flattening. So, in general, users of the data page statistics will need to check whether or not the length of the array matches the number of actual data pages? This is different from how the row group statistics are handled - they will instead have a null value for any missing statistics.

Is this difference in behaviour expected or just a side effect of the implementation.

Originally posted by @efredine in #10922 (comment)

@alamb
Copy link
Contributor Author

alamb commented Jul 5, 2024

Dupe of #11280

@alamb alamb closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant