Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ClickBench queries to avoid to_timestamp_seconds #15465

Closed
alamb opened this issue Mar 27, 2025 · 6 comments · Fixed by #15475
Closed

Update ClickBench queries to avoid to_timestamp_seconds #15465

alamb opened this issue Mar 27, 2025 · 6 comments · Fixed by #15475
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 27, 2025

Is your feature request related to a problem or challenge?

For some reason the DataFusion version of the ClickBench queries use the to_timestamp_seconds function:

SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

However that function does timestamp validation and potentially slows down queries and prevents other optimizations (for example what @adriangb is doing in #15301)

I checked and DuckDB simply uses EventTime

https://github.com/ClickHouse/ClickBench/blob/bdc6e32589c2785a66ccee98904a322c5e5d3f50/duckdb/queries.sql#L24C1-L25C1

As does ClickHouse
https://github.com/ClickHouse/ClickBench/blob/bdc6e32589c2785a66ccee98904a322c5e5d3f50/clickhouse/queries.sql#L24

Describe the solution you'd like

Ideally the queries would be updated so they do not use to_timestamp_seconds

For example

SELECT * FROM 'hits.parquet' WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10

Describe alternatives you've considered

  1. Update queries in this repo
  2. Update the queries in the clickbench repository as well: https://github.com/ClickHouse/ClickBench/blob/bdc6e32589c2785a66ccee98904a322c5e5d3f50/datafusion/queries.sql

Additional context

No response

@acking-you
Copy link
Contributor

However that function does timestamp validation and potentially slows down queries and prevents other optimizations (for example what @adriangb is doing in #15301)

I'm very curious about why this optimization is being blocked.

@acking-you
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Mar 27, 2025

However that function does timestamp validation and potentially slows down queries and prevents other optimizations (for example what @adriangb is doing in #15301)

I'm very curious about why this optimization is being blocked.

I think it is because we don't have the analysis to translate the min(to_timestamp_seconds(column)) into a column < [current topk min heap value]

@acking-you
Copy link
Contributor

acking-you commented Mar 28, 2025

I think it is because we don't have the analysis to translate the min(to_timestamp_seconds(column)) into a column < [current topk min heap value]

  1. So, should this situation be optimized as well?

  2. I looked at the EventTime field in the hits dataset and found nothing special about it other than that it’s of type int64. I’m not quite sure why ORDER BY to_timestamp_seconds("EventTime") is used. I know this function converts the field to the timestamp type, but I don’t understand the significance of doing so.

  3. I also saw other statements containing to_timestamp_seconds("EventTime"). Do those statements need to be modified?

@Dandandan
Copy link
Contributor

Another discrepancy I found in the queries is the "EventDate"::INT::DATE" casting. Is this something we could remove as well? Maybe would be good to look at all further that are applied to the queries and undo them if possible (or file issues when the planner fails to plan them).

@alamb
Copy link
Contributor Author

alamb commented Mar 31, 2025

Another discrepancy I found in the queries is the "EventDate"::INT::DATE" casting. Is this something we could remove as well? Maybe would be good to look at all further that are applied to the queries and undo them if possible (or file issues when the planner fails to plan them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants