[Bug] Performance degradation by clustering/sorting #127

rwang-lyra · 2024-07-24T22:32:24Z

Is there an existing issue for this?

I have searched the existing issues

Describe the issue

we have observed much longer runtime with "cluster_by = ['transaction_id']," introduced in v0.13.0 release; the code refence is here: https://github.com/fivetran/dbt_netsuite/pull/117/files#diff-144d46b313d4b1851f6b2a20d16c25a6f41758af76cb008d874c1f61530383f3R7 and is with 3 major models.

seems the sorting takes about 40% of runtime and snowflake warns about the high cardinatlity of transcation_id is causing long runtime.

we have tried locally to compare; on balance_sheet model - 25 minutes with current code, clustering enables vs 14 minutes with clustering removed.

Relevant error log or model output

No response

Expected behavior

shorter execution time

dbt Project configurations

dbt-core==1.7.14
dbt-snowflake==1.7.5

Package versions

package: dbt-labs/dbt_utils
version: 1.1.1
package: brooklyn-data/dbt_artifacts
version: 2.3.0
package: dbt-labs/dbt_project_evaluator
version: 0.8.0
package: dbt-labs/dbt_external_tables
version: 0.8.7

What database are you using dbt with?

snowflake

dbt Version

dbt-core==1.7.14
dbt-snowflake==1.7.5

Additional Context

No response

Are you willing to open a PR to help address this issue?

Yes.
Yes, but I will need assistance and will schedule time during our office hours for guidance
No.

fivetran-catfritz · 2024-07-25T14:53:08Z

Hi @rwang-lyra thank you for bringing this to our attention! I reviewed it, and I am wondering if it would be better to use a different column for clustering, such as the date column _fivetran_synced_date used in the incremental logic. Would you be able to try out my test branch below if that helps your runtimes? You can install it in place of your normal netsuite package in your packages.yml. You'll also probably need to run a full refresh first before an incremental run since I've changed the cluster by column.

- git: https://github.com/fivetran/dbt_netsuite.git
  revision: bug/cluster-by-performance
  warn-unpinned: false

The conclusion could very well be that we should remove clustering, but I'd be curious to try this out first.

rwang-lyra · 2024-07-25T18:06:54Z

Thanks @fivetran-catfritz . It doesn't seems these models fall under the recommendation of clustering by snowflake :(https://docs.snowflake.com/en/user-guide/tables-clustering-keys#considerations-for-choosing-clustering-for-a-table), and also high cardinality of the column could be expensive.

In general, if a column (or expression) has higher cardinality, then maintaining clustering on that column is more expensive.

The cost of clustering on a unique key might be more than the benefit of clustering on that key, especially if point lookups are not the primary use case for that table.

If you want to use a column with very high cardinality as a clustering key, Snowflake recommends defining the key as an expression on the column, rather than on the column directly, to reduce the number of distinct values. The expression should preserve the original ordering of the column so that the minimum and maximum values in each partition still enable pruning.

For example, if a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro-partitions in the table), then a clustering key could be defined on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp)). This would reduce the cardinality to the total number of days, which typically produces much better pruning results.

for learnings and validating, I've just tested with the branch above, the runtime is even much longer (40min) with clustering on timestamp field.

I wonder if any reasoning behind clustering - perhaps it is not needed for snowflake?

fivetran-catfritz · 2024-07-26T17:41:47Z

@rwang-lyra Thanks for the additional information! I was reading through the doc you linked, and it suggested that clustering on a date column might be a good strategy.

For example, if a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro-partitions in the table), then a clustering key could be defined on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp)). This would reduce the cardinality to the total number of days, which typically produces much better pruning results.

In my test branch, fivetran_synced_date is a date and not a timestamp (here is the line where that is set up), so I'm wondering why it is causing worse performance than the transaction_id, which would have higher cardinality. Some questions I'm thinking of:

When you are doing your dbt runs, are you using a --full-refresh each time or setting the materialization as a table, or do you also allow the incremental strategy we include?
Are the performance issues occurring during a full-refresh/table run and/or during an incremental run? While the performance may be slower on the first run, the hope is that the incremental runs are improved since that clustering only has to be established once.
We've had other customers request clustering for Snowflake in other packages, for example in [this issue] ([Bug] partition_by config parameter doesn't work with Snowflake dbt_mixpanel#37) for our mixpanel package, which is why we included it in this release. But in the case of mixpanel, which is events rather than transactions, I realize it could be a different volume of data we're dealing with, so curious approximately how large is your transactions table?

Let me know what your thoughts are, but it could be beneficial for us to go over this on a live call. If you'd like to have a meeting, you can schedule some time with us via this link! Thanks again for looking into this issue with us.

fivetran-catfritz · 2024-08-09T15:33:09Z

Hi @rwang-lyra Just a friendly bump on this—whenever you have a chance to take a look, I’d love to hear your thoughts!

fivetran-catfritz · 2024-08-26T19:40:33Z

Hi @rwang-lyra I hope you’re doing well. I’m going to mark this ticket as stale for now since it's been a while since your last response. If you are able to revisit this issue, feel free to reply here to re-open the conversation, and we’ll be happy to help!

fivetran-joemarkiewicz added type:bug Something is broken or incorrect status:scoping Currently being scoped error:unforced labels Jul 25, 2024

fivetran-catfritz added status:stale Issue was blocked or had no user response for more than 30 days and removed status:scoping Currently being scoped labels Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Performance degradation by clustering/sorting #127

[Bug] Performance degradation by clustering/sorting #127

rwang-lyra commented Jul 24, 2024

fivetran-catfritz commented Jul 25, 2024

rwang-lyra commented Jul 25, 2024 •

edited

Loading

fivetran-catfritz commented Jul 26, 2024 •

edited

Loading

fivetran-catfritz commented Aug 9, 2024

fivetran-catfritz commented Aug 26, 2024

[Bug] Performance degradation by clustering/sorting #127

[Bug] Performance degradation by clustering/sorting #127

Comments

rwang-lyra commented Jul 24, 2024

Is there an existing issue for this?

Describe the issue

Relevant error log or model output

Expected behavior

dbt Project configurations

Package versions

What database are you using dbt with?

dbt Version

Additional Context

Are you willing to open a PR to help address this issue?

fivetran-catfritz commented Jul 25, 2024

rwang-lyra commented Jul 25, 2024 • edited Loading

fivetran-catfritz commented Jul 26, 2024 • edited Loading

fivetran-catfritz commented Aug 9, 2024

fivetran-catfritz commented Aug 26, 2024

rwang-lyra commented Jul 25, 2024 •

edited

Loading

fivetran-catfritz commented Jul 26, 2024 •

edited

Loading