Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Performance degradation by clustering/sorting #127

Open
2 of 4 tasks
rwang-lyra opened this issue Jul 24, 2024 · 5 comments
Open
2 of 4 tasks

[Bug] Performance degradation by clustering/sorting #127

rwang-lyra opened this issue Jul 24, 2024 · 5 comments
Labels
error:unforced status:stale Issue was blocked or had no user response for more than 30 days type:bug Something is broken or incorrect

Comments

@rwang-lyra
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

we have observed much longer runtime with "cluster_by = ['transaction_id']," introduced in v0.13.0 release; the code refence is here: https://github.com/fivetran/dbt_netsuite/pull/117/files#diff-144d46b313d4b1851f6b2a20d16c25a6f41758af76cb008d874c1f61530383f3R7 and is with 3 major models.

seems the sorting takes about 40% of runtime and snowflake warns about the high cardinatlity of transcation_id is causing long runtime.

we have tried locally to compare; on balance_sheet model - 25 minutes with current code, clustering enables vs 14 minutes with clustering removed.

Relevant error log or model output

No response

Expected behavior

shorter execution time

dbt Project configurations

dbt-core==1.7.14
dbt-snowflake==1.7.5

Package versions

  • package: dbt-labs/dbt_utils
    version: 1.1.1
  • package: brooklyn-data/dbt_artifacts
    version: 2.3.0
  • package: dbt-labs/dbt_project_evaluator
    version: 0.8.0
  • package: dbt-labs/dbt_external_tables
    version: 0.8.7

What database are you using dbt with?

snowflake

dbt Version

dbt-core==1.7.14
dbt-snowflake==1.7.5

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.
@fivetran-catfritz
Copy link
Contributor

Hi @rwang-lyra thank you for bringing this to our attention! I reviewed it, and I am wondering if it would be better to use a different column for clustering, such as the date column _fivetran_synced_date used in the incremental logic. Would you be able to try out my test branch below if that helps your runtimes? You can install it in place of your normal netsuite package in your packages.yml. You'll also probably need to run a full refresh first before an incremental run since I've changed the cluster by column.

- git: https://github.com/fivetran/dbt_netsuite.git
  revision: bug/cluster-by-performance
  warn-unpinned: false

The conclusion could very well be that we should remove clustering, but I'd be curious to try this out first.

@fivetran-joemarkiewicz fivetran-joemarkiewicz added type:bug Something is broken or incorrect status:scoping Currently being scoped error:unforced labels Jul 25, 2024
@rwang-lyra
Copy link
Contributor Author

rwang-lyra commented Jul 25, 2024

Thanks @fivetran-catfritz . It doesn't seems these models fall under the recommendation of clustering by snowflake :(https://docs.snowflake.com/en/user-guide/tables-clustering-keys#considerations-for-choosing-clustering-for-a-table), and also high cardinality of the column could be expensive.

In general, if a column (or expression) has higher cardinality, then maintaining clustering on that column is more expensive.

The cost of clustering on a unique key might be more than the benefit of clustering on that key, especially if point lookups are not the primary use case for that table.

If you want to use a column with very high cardinality as a clustering key, Snowflake recommends defining the key as an expression on the column, rather than on the column directly, to reduce the number of distinct values. The expression should preserve the original ordering of the column so that the minimum and maximum values in each partition still enable pruning.

For example, if a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro-partitions in the table), then a clustering key could be defined on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp)). This would reduce the cardinality to the total number of days, which typically produces much better pruning results.

for learnings and validating, I've just tested with the branch above, the runtime is even much longer (40min) with clustering on timestamp field.

I wonder if any reasoning behind clustering - perhaps it is not needed for snowflake?

@fivetran-catfritz
Copy link
Contributor

fivetran-catfritz commented Jul 26, 2024

@rwang-lyra Thanks for the additional information! I was reading through the doc you linked, and it suggested that clustering on a date column might be a good strategy.

For example, if a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro-partitions in the table), then a clustering key could be defined on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp)). This would reduce the cardinality to the total number of days, which typically produces much better pruning results.

In my test branch, fivetran_synced_date is a date and not a timestamp (here is the line where that is set up), so I'm wondering why it is causing worse performance than the transaction_id, which would have higher cardinality. Some questions I'm thinking of:

  1. When you are doing your dbt runs, are you using a --full-refresh each time or setting the materialization as a table, or do you also allow the incremental strategy we include?
  2. Are the performance issues occurring during a full-refresh/table run and/or during an incremental run? While the performance may be slower on the first run, the hope is that the incremental runs are improved since that clustering only has to be established once.
  3. We've had other customers request clustering for Snowflake in other packages, for example in [this issue] ([Bug] partition_by config parameter doesn't work with Snowflake dbt_mixpanel#37) for our mixpanel package, which is why we included it in this release. But in the case of mixpanel, which is events rather than transactions, I realize it could be a different volume of data we're dealing with, so curious approximately how large is your transactions table?

Let me know what your thoughts are, but it could be beneficial for us to go over this on a live call. If you'd like to have a meeting, you can schedule some time with us via this link! Thanks again for looking into this issue with us.

@fivetran-catfritz
Copy link
Contributor

Hi @rwang-lyra Just a friendly bump on this—whenever you have a chance to take a look, I’d love to hear your thoughts!

@fivetran-catfritz
Copy link
Contributor

Hi @rwang-lyra I hope you’re doing well. I’m going to mark this ticket as stale for now since it's been a while since your last response. If you are able to revisit this issue, feel free to reply here to re-open the conversation, and we’ll be happy to help!

@fivetran-catfritz fivetran-catfritz added status:stale Issue was blocked or had no user response for more than 30 days and removed status:scoping Currently being scoped labels Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error:unforced status:stale Issue was blocked or had no user response for more than 30 days type:bug Something is broken or incorrect
Projects
None yet
Development

No branches or pull requests

3 participants