-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Performance degradation by clustering/sorting #127
Comments
Hi @rwang-lyra thank you for bringing this to our attention! I reviewed it, and I am wondering if it would be better to use a different column for clustering, such as the date column - git: https://github.com/fivetran/dbt_netsuite.git
revision: bug/cluster-by-performance
warn-unpinned: false The conclusion could very well be that we should remove clustering, but I'd be curious to try this out first. |
Thanks @fivetran-catfritz . It doesn't seems these models fall under the recommendation of clustering by snowflake :(https://docs.snowflake.com/en/user-guide/tables-clustering-keys#considerations-for-choosing-clustering-for-a-table), and also high cardinality of the column could be expensive.
for learnings and validating, I've just tested with the branch above, the runtime is even much longer (40min) with clustering on timestamp field. I wonder if any reasoning behind clustering - perhaps it is not needed for snowflake? |
@rwang-lyra Thanks for the additional information! I was reading through the doc you linked, and it suggested that clustering on a date column might be a good strategy.
In my test branch,
Let me know what your thoughts are, but it could be beneficial for us to go over this on a live call. If you'd like to have a meeting, you can schedule some time with us via this link! Thanks again for looking into this issue with us. |
Hi @rwang-lyra Just a friendly bump on this—whenever you have a chance to take a look, I’d love to hear your thoughts! |
Hi @rwang-lyra I hope you’re doing well. I’m going to mark this ticket as stale for now since it's been a while since your last response. If you are able to revisit this issue, feel free to reply here to re-open the conversation, and we’ll be happy to help! |
Is there an existing issue for this?
Describe the issue
we have observed much longer runtime with "cluster_by = ['transaction_id']," introduced in v0.13.0 release; the code refence is here: https://github.com/fivetran/dbt_netsuite/pull/117/files#diff-144d46b313d4b1851f6b2a20d16c25a6f41758af76cb008d874c1f61530383f3R7 and is with 3 major models.
seems the sorting takes about 40% of runtime and snowflake warns about the high cardinatlity of transcation_id is causing long runtime.
we have tried locally to compare; on balance_sheet model - 25 minutes with current code, clustering enables vs 14 minutes with clustering removed.
Relevant error log or model output
No response
Expected behavior
shorter execution time
dbt Project configurations
dbt-core==1.7.14
dbt-snowflake==1.7.5
Package versions
version: 1.1.1
version: 2.3.0
version: 0.8.0
version: 0.8.7
What database are you using dbt with?
snowflake
dbt Version
dbt-core==1.7.14
dbt-snowflake==1.7.5
Additional Context
No response
Are you willing to open a PR to help address this issue?
The text was updated successfully, but these errors were encountered: