-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/performance enhancement #41
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-catfritz thanks for working through this and being diligent in ensuring the optimizations being applied to the incremental strategies are the gold standard!
As we just synced on this review, I found there were some inconsistencies in the aggregate end models (daily_events to be specific) that didn't seem to tie out. The staging model end end mixpanel__event model looked good, but we need to do a deeper dive into the other incremental models to ensure the incremental logic is working as we would expect.
Would you be able to apply the updates suggested in the comments below as well as update the incremental logic where needed to ensure the addition of new records in subsequent runs for the aggregate models work as we would expect. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-joemarkiewicz I have updated the docs as suggested. I have also updated the strategies a bit, so let me know if you want to chat about it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for making these updates! left a few comments / suggestions for you.
CHANGELOG.md
Outdated
>Note: This update was made breaking since it will alter the materialization of existing models. While these changes do not necessitate a `--full-refresh`, it may be beneficial if you run into issues with this update. | ||
- Updated models with the following performance improvements: | ||
- Update the incremental strategy for all models to `insert_overwrite` for BigQuery and Databricks and `delete+insert` for all other warehouses. | ||
- Removed `stg_mixpanel__event_tmp` in favor of `stg_mixpanel__event_tmp`, which is now an incremental model. While this will increase storage, this change was made to improve compute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the second table name is wrong. did you mean:
Removed
stg_mixpanel__event_tmp
in favor ofstg_mixpanel__event
, which is now ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes. 😄
CHANGELOG.md
Outdated
|
||
## Feature Updates | ||
- Added `cluster_by` columns to the configs for incremental models. This will benefit Snowflake and BigQuery users. | ||
- Added column `dbt_run_date` to incremental models to improve accuracy and optimize downstream models. This date captures the date a record was added or updated by this package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! it's always helpful to have these timestamp columns.
README.md
Outdated
|
||
We recognize there are some limitations with these strategies, particularly around updated records in the past which cause duplicates, and are assessing using a different strategy in the future. | ||
`insert_overwrite` is our preferred incremental strategy because it will be able to properly handle updates to records that exist outside the immediate incremental window. That is, because it leverages partitions, `insert_overwrite` will appropriately update existing rows that have been changed upstream instead of inserting duplicates of them--all without requiring a full table scan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you define immediate incremental window
? if the queries only pull in data from the last X days, do you mean changes that occurred to records >X days ago?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc comments. I am going to rework this tomorrow!
README.md
Outdated
|
||
> For either of these strategies, we highly recommend that users periodically run a `--full-refresh` to ensure a high level of data quality. | ||
`delete+insert` is our second-choice as it resembles `insert_overwrite` but lacks partitions. This strategy works most of the time and appropriately handles incremental loads that do not contain changes to past records. However, if a past record has been updated and is outside of the incremental window, `delete+insert` will insert a duplicate record. 😱 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be worth qualifying preferred
and second-choice
as it depends on the data platform. As an example, Snowflake doesn't support insert_overwrite
. I don't think you mean to imply that Snowflake is using a lesser methodology.
https://docs.getdbt.com/docs/build/incremental-models#about-incremental_strategy
models/mixpanel__daily_events.sql
Outdated
-- only return the most recent day of data | ||
where date_day >= coalesce( (select max(date_day) from {{ this }} ), '2010-01-01') | ||
|
||
where date_day >= {{ mixpanel.lookback(from_date="max(dbt_run_date)") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this potentially exclude some late arriving data that occurred prior to the max(dbt_run_date)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you for catching. I ended up scrapping using dbt_run_date
in the incremental strategy and updated to use the 7-day lookback to be where date_day >= {{ mixpanel.mixpanel_lookback(from_date="max(date_day)", interval=var('lookback_window', 7), datepart='day') }}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome! nice improvement.
macros/lookback.sql
Outdated
@@ -0,0 +1,9 @@ | |||
{% macro lookback(from_date, datepart='day', interval=7, default_start_date='2010-01-01') %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having '2010-01-01' be a project variable could help if someone wants to have their own custom start date. 14 years is a LOT of data to include by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize calling this value default_start_date
is a bit misleading. The 2010-01-01
is only meant as a super cautious failsafe in case the of null values in an incremental run, which I imagine would rarely occur. I'll rename this. The actual start date of events being brought in can be set by variable date_range_start
. Thanks for the callout!
models/mixpanel__event.sql
Outdated
-- events are only eligible for de-duping if they occurred on the same calendar day | ||
occurred_at >= coalesce((select cast( max(date_day) as {{ dbt.type_timestamp() }} ) from {{ this }} ), '2010-01-01') | ||
{% if is_incremental() %} | ||
dbt_run_date >= {{ mixpanel.lookback(from_date="max(dbt_run_date)", interval=1) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move the default interval period to a project variable? i assume one day will be enough but maybe there's scenarios where more is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jasongroob that is a great idea. I have have created a variable lookback_windows
with a default of value 7 days. In your experience with mixpanel, would that be a more appropriate length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i think that should be fine. i don't have a good sense of how late new data can arrive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know the impacts on compute / storage of moving this table from ephemeral to incremental? if the mixpanel.event table is large (which it probably is), you are effectively writing the data three times. Once at the source, once in this STG model, and once in mixpanel__event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jasongroob this is a good point. Another user mentioned wanting to avoid full table scans, but at the end of the day it doesn't make much sense to have a large table 3 times. I have tried to balance the storage vs compute a little more by putting the stg
model back to ephemeral since the tmp
view has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. i think the current implementation of having the staging be ephemeral is a nice balance between compute and storage concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-catfritz great job on this PR! I know there has been a lot of back and forth here. This is nearly across the line, I just have a few final requests to update before approving. Please don't hesitate to reach out if you have any questions regarding my comments. Thanks!
CHANGELOG.md
Outdated
- `mixpanel__monthly_events` | ||
- `mixpanel__sessions` | ||
- Removed `stg_mixpanel__event_tmp` in favor of ephemeral model `stg_mixpanel__event`. This is to reduce redundancy of models created and reduce the number of full scans. | ||
- Updated the materialization of `stg_mixpanel__user_first_event` to a view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the reasoning you gave to the previous bullet. Can we also add a blurb about why converting the stg_mixpanel__user_first_event
to a view
is more performant.
CHANGELOG.md
Outdated
|
||
## Feature Updates | ||
- Added a default 7-day look-back to incremental models to accommodate late arriving events. The numbers of days can be changed by setting the var `lookback_window` in your dbt_project.yml. See the [Lookback Window section of the README](https://github.com/fivetran/dbt_mixpanel/blob/main/README.md#lookback-window) for more details. | ||
- Note: this replaces the variable `sessionization_trailing_window`, which was previously used in the `mixpanel__sessions` model. This variable was replaced due to the change in the incremental and lookback strategy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of this variable should be treated as a breaking change in case users are leveraging this in their current workflow. Would you be able to move this to the breaking change section.
Co-authored-by: Joe Markiewicz <[email protected]>
Co-authored-by: Joe Markiewicz <[email protected]>
Co-authored-by: Joe Markiewicz <[email protected]>
@fivetran-joemarkiewicz Thanks for all the review comments! I have updated, and this is ready for re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fivetran-catfritz Great job!! This is looks good to go to me!
PR Overview
This PR will address the following Issue/Feature:
This PR will result in the following new package version:
stg_mixpanel__event_tmp
table and changingstg_mixpanel__event
to an incremental model.Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:
🚨 Breaking Changes 🚨
insert_overwrite
for BigQuery and Databricks anddelete+insert
for all other warehouses.stg_mixpanel__event
andstg_mixpanel__event_tmp
into one incremental model. While this will increase storage, we opted to make this change to improve compute.Feature Updates
cluster_by
columns to the configs for incremental models. This will benefit Snowflake and BigQuery users.PR Checklist
Basic Validation
Please acknowledge that you have successfully performed the following commands locally:
dbt run --full-refresh && dbt test
dbt run && dbt test
(if incremental models are present)Before marking this PR as "ready for review" the following have been applied:
Detailed Validation
Please share any and all of your validation steps:
If you had to summarize this PR in an emoji, which would it be?
⏲️