Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Update partitioning logic to account for source_relation, empty source tables and union data #77

Closed
2 of 4 tasks
fivetran-catfritz opened this issue Feb 7, 2024 · 2 comments · Fixed by #78
Closed
2 of 4 tasks
Assignees
Labels
error:unforced status:in_review Currently in review type:bug Something is broken or incorrect type:enhancement New functionality or enhancement update_type:models Primary focus requires model updates

Comments

@fivetran-catfritz
Copy link
Contributor

fivetran-catfritz commented Feb 7, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

Sub-issue 1: Add source_relation to all partition by clauses

For models stg_shopify__metafield and stg_shopify__abandoned_checkout_discount_code, the column source_relation should be added to the partition by clauses used to determine is_most_recent_record.

Same update also needs to be added to line 5 of int_shopify__customer_email_rollup in the transform.

Below update from @fivetran-avinash
Sub-issue 2: Identify best approach to handle empty source models for Redshift

In Redshift, an empty source model that contains a partitioning clause will fail at the staging layer with the constant expressions are not supported in partition by clauses error. One customer reported this error arising from the stg_shopify__abandoned_checkout_discount_code model. This particularly comes into play when unioning schemas and databases.

There are several approaches that could be taken, each which have pros and cons. See the comments for more details on the options.

The path forward we chose will be hard coding a case when statement for the situation where a table is empty.

Example code below:


  case when checkout_id is null and code is null and index is null
    then row_number() over(partition by source_relation) as index
    else row_number() over(partition by checkout_id, upper(code), source_relation order by index desc)
  end as index

Relevant error log or model output

No response

Expected behavior

is_most_recent_record will produce correct flag when used with unioned sources.
Null tables will not generate partitioning errors.

dbt Project configurations

n/a

Package versions

v0.10.0

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.
@fivetran-catfritz fivetran-catfritz added bug Something isn't working type:bug Something is broken or incorrect and removed bug Something isn't working labels Feb 7, 2024
@fivetran-avinash
Copy link
Contributor

fivetran-avinash commented Feb 13, 2024

Options for sub-issue 2, path TBD:

  • Option 1 Apply a variable that allows customers to enable/disable models.
    • Pros: This is the simplest potential implementation that should prevent errors from compiling.
    • Cons: We want to avoid variable bloat, particularly for variables that have an uncertain start date--if customers go from not using discounts (as they usually do) to using discounts, it'll cause disruptions for Quickstart users. And Quickstart customers cannot configure variables. Likely won't do.
  • Option 2 Apply a conditional based on the presence/absence of the union_schemas or union_databases variable.
    • Pros: This would future-proof this issue with regards to union data.
    • Cons: It's likely we run into the same error regardless of whether schemas are unioned or not. This would also require a non-trivial amount of exploration to determine the best implementation. Likely won't do.
  • Option 3 Hard code a case when statement for the situation where a table is empty.
    • Pros: Fairly simple application, even if slightly code heavy.
    • Cons: This would require some investigation as to whether downstream models would be impacted, and verifying row number accuracy.

Example code below:

  case when checkout_id is null and code is null and index is null
    then row_number() over(partition by source_relation) as index
    else row_number() over(partition by checkout_id, upper(code), source_relation order by index desc)
  end as index
  • Option 4 An offshoot of Option 3, involving a coalesce on the row_number.
    • Pros: Easiest to apply at scale.
    • Cons: Similar to Option 3, will require investigation.

Example code below:
row_number() over(partition by coalesce(checkout_id,-9999), coalesce(upper(code),'empty'), source_relation order by coalesce(index,-9999) desc) as index

@fivetran-avinash fivetran-avinash changed the title [Bug] Add source_relation to all partition by clauses [Bug] Update partitioning logic to account for source_relation, empty source tables and union data Feb 13, 2024
@fivetran-joemarkiewicz fivetran-joemarkiewicz added the status:scoping Currently being scoped label Feb 13, 2024
@fivetran-avinash
Copy link
Contributor

After much deliberation, Option 3 it is! Will update the ticket shortly.

@fivetran-avinash fivetran-avinash self-assigned this Mar 4, 2024
@fivetran-avinash fivetran-avinash added status:in_progress Currently being worked on and removed status:scoping Currently being scoped labels Mar 4, 2024
@fivetran-avinash fivetran-avinash linked a pull request Mar 5, 2024 that will close this issue
13 tasks
@fivetran-avinash fivetran-avinash added type:enhancement New functionality or enhancement update_type:models Primary focus requires model updates status:in_review Currently in review and removed status:in_progress Currently being worked on labels Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error:unforced status:in_review Currently in review type:bug Something is broken or incorrect type:enhancement New functionality or enhancement update_type:models Primary focus requires model updates
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants