Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/performance enhancement #41

Merged
merged 42 commits into from
Feb 21, 2024

Conversation

fivetran-catfritz
Copy link
Contributor

@fivetran-catfritz fivetran-catfritz commented Jan 25, 2024

PR Overview

This PR will address the following Issue/Feature:

This PR will result in the following new package version:

  • v0.9.0
  • Made this breaking since we are deleting the stg_mixpanel__event_tmp table and changing stg_mixpanel__event to an incremental model.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

🚨 Breaking Changes 🚨

Note: This update was made breaking since it will alter the materialization of existing models. While these changes do not necessitate a --full-refresh, it may be beneficial if you run into issues with this update.

  • Updated models with the following performance improvements:
    • Update the incremental strategy for all models to insert_overwrite for BigQuery and Databricks and delete+insert for all other warehouses.
    • Consolidated stg_mixpanel__event and stg_mixpanel__event_tmp into one incremental model. While this will increase storage, we opted to make this change to improve compute.

Feature Updates

  • Added cluster_by columns to the configs for incremental models. This will benefit Snowflake and BigQuery users.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt run --full-refresh && dbt test
  • dbt run && dbt test (if incremental models are present)

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked, tagged, and properly assigned.
  • All necessary documentation and version upgrades have been applied.
  • docs were regenerated (unless this PR does not include any code or yml updates).
  • BuildKite integration tests are passing.
  • Detailed validation steps have been provided below.

Detailed Validation

Please share any and all of your validation steps:

  • Ran full refreshes and and incremental runs in all the major warehouses.
  • See internal ticket for detailed validation.

If you had to summarize this PR in an emoji, which would it be?

⏲️

@fivetran-catfritz fivetran-catfritz self-assigned this Jan 25, 2024
Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz thanks for working through this and being diligent in ensuring the optimizations being applied to the incremental strategies are the gold standard!

As we just synced on this review, I found there were some inconsistencies in the aggregate end models (daily_events to be specific) that didn't seem to tie out. The staging model end end mixpanel__event model looked good, but we need to do a deeper dive into the other incremental models to ensure the incremental logic is working as we would expect.

Would you be able to apply the updates suggested in the comments below as well as update the incremental logic where needed to ensure the addition of new records in subsequent runs for the aggregate models work as we would expect. Thanks!

README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz I have updated the docs as suggested. I have also updated the strategies a bit, so let me know if you want to chat about it!

README.md Show resolved Hide resolved
Copy link

@jasongroob jasongroob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for making these updates! left a few comments / suggestions for you.

CHANGELOG.md Outdated
>Note: This update was made breaking since it will alter the materialization of existing models. While these changes do not necessitate a `--full-refresh`, it may be beneficial if you run into issues with this update.
- Updated models with the following performance improvements:
- Update the incremental strategy for all models to `insert_overwrite` for BigQuery and Databricks and `delete+insert` for all other warehouses.
- Removed `stg_mixpanel__event_tmp` in favor of `stg_mixpanel__event_tmp`, which is now an incremental model. While this will increase storage, this change was made to improve compute.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the second table name is wrong. did you mean:

Removed stg_mixpanel__event_tmp in favor of stg_mixpanel__event, which is now ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes. 😄

CHANGELOG.md Outdated

## Feature Updates
- Added `cluster_by` columns to the configs for incremental models. This will benefit Snowflake and BigQuery users.
- Added column `dbt_run_date` to incremental models to improve accuracy and optimize downstream models. This date captures the date a record was added or updated by this package.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! it's always helpful to have these timestamp columns.

README.md Outdated

We recognize there are some limitations with these strategies, particularly around updated records in the past which cause duplicates, and are assessing using a different strategy in the future.
`insert_overwrite` is our preferred incremental strategy because it will be able to properly handle updates to records that exist outside the immediate incremental window. That is, because it leverages partitions, `insert_overwrite` will appropriately update existing rows that have been changed upstream instead of inserting duplicates of them--all without requiring a full table scan.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you define immediate incremental window? if the queries only pull in data from the last X days, do you mean changes that occurred to records >X days ago?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc comments. I am going to rework this tomorrow!

README.md Outdated

> For either of these strategies, we highly recommend that users periodically run a `--full-refresh` to ensure a high level of data quality.
`delete+insert` is our second-choice as it resembles `insert_overwrite` but lacks partitions. This strategy works most of the time and appropriately handles incremental loads that do not contain changes to past records. However, if a past record has been updated and is outside of the incremental window, `delete+insert` will insert a duplicate record. 😱

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth qualifying preferred and second-choice as it depends on the data platform. As an example, Snowflake doesn't support insert_overwrite. I don't think you mean to imply that Snowflake is using a lesser methodology.

https://docs.getdbt.com/docs/build/incremental-models#about-incremental_strategy

-- only return the most recent day of data
where date_day >= coalesce( (select max(date_day) from {{ this }} ), '2010-01-01')

where date_day >= {{ mixpanel.lookback(from_date="max(dbt_run_date)") }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this potentially exclude some late arriving data that occurred prior to the max(dbt_run_date)?

Copy link
Contributor Author

@fivetran-catfritz fivetran-catfritz Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for catching. I ended up scrapping using dbt_run_date in the incremental strategy and updated to use the 7-day lookback to be where date_day >= {{ mixpanel.mixpanel_lookback(from_date="max(date_day)", interval=var('lookback_window', 7), datepart='day') }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome! nice improvement.

@@ -0,0 +1,9 @@
{% macro lookback(from_date, datepart='day', interval=7, default_start_date='2010-01-01') %}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having '2010-01-01' be a project variable could help if someone wants to have their own custom start date. 14 years is a LOT of data to include by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize calling this value default_start_date is a bit misleading. The 2010-01-01 is only meant as a super cautious failsafe in case the of null values in an incremental run, which I imagine would rarely occur. I'll rename this. The actual start date of events being brought in can be set by variable date_range_start. Thanks for the callout!

-- events are only eligible for de-duping if they occurred on the same calendar day
occurred_at >= coalesce((select cast( max(date_day) as {{ dbt.type_timestamp() }} ) from {{ this }} ), '2010-01-01')
{% if is_incremental() %}
dbt_run_date >= {{ mixpanel.lookback(from_date="max(dbt_run_date)", interval=1) }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move the default interval period to a project variable? i assume one day will be enough but maybe there's scenarios where more is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jasongroob that is a great idea. I have have created a variable lookback_windows with a default of value 7 days. In your experience with mixpanel, would that be a more appropriate length?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think that should be fine. i don't have a good sense of how late new data can arrive.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know the impacts on compute / storage of moving this table from ephemeral to incremental? if the mixpanel.event table is large (which it probably is), you are effectively writing the data three times. Once at the source, once in this STG model, and once in mixpanel__event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jasongroob this is a good point. Another user mentioned wanting to avoid full table scans, but at the end of the day it doesn't make much sense to have a large table 3 times. I have tried to balance the storage vs compute a little more by putting the stg model back to ephemeral since the tmp view has been removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. i think the current implementation of having the staging be ephemeral is a nice balance between compute and storage concerns.

Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz great job on this PR! I know there has been a lot of back and forth here. This is nearly across the line, I just have a few final requests to update before approving. Please don't hesitate to reach out if you have any questions regarding my comments. Thanks!

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated
- `mixpanel__monthly_events`
- `mixpanel__sessions`
- Removed `stg_mixpanel__event_tmp` in favor of ephemeral model `stg_mixpanel__event`. This is to reduce redundancy of models created and reduce the number of full scans.
- Updated the materialization of `stg_mixpanel__user_first_event` to a view.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the reasoning you gave to the previous bullet. Can we also add a blurb about why converting the stg_mixpanel__user_first_event to a view is more performant.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated

## Feature Updates
- Added a default 7-day look-back to incremental models to accommodate late arriving events. The numbers of days can be changed by setting the var `lookback_window` in your dbt_project.yml. See the [Lookback Window section of the README](https://github.com/fivetran/dbt_mixpanel/blob/main/README.md#lookback-window) for more details.
- Note: this replaces the variable `sessionization_trailing_window`, which was previously used in the `mixpanel__sessions` model. This variable was replaced due to the change in the incremental and lookback strategy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of this variable should be treated as a breaking change in case users are leveraging this in their current workflow. Would you be able to move this to the breaking change section.

models/mixpanel__monthly_events.sql Outdated Show resolved Hide resolved
models/staging/stg_mixpanel__user_event_date_spine.sql Outdated Show resolved Hide resolved
models/mixpanel__sessions.sql Outdated Show resolved Hide resolved
models/mixpanel__sessions.sql Outdated Show resolved Hide resolved
models/mixpanel__daily_events.sql Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@fivetran-catfritz
Copy link
Contributor Author

@fivetran-joemarkiewicz Thanks for all the review comments! I have updated, and this is ready for re-review.

Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-catfritz Great job!! This is looks good to go to me!

@fivetran-avinash fivetran-avinash self-requested a review February 21, 2024 20:15
@fivetran-catfritz fivetran-catfritz merged commit 19e968c into main Feb 21, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants