-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/performance enhancement #41
Changes from 35 commits
29c24b7
3497935
a9119ce
0de2a07
1129752
198081e
46fd5c4
18ac897
c6cd6a8
0fd82bf
334010b
f049d87
bcc310d
19901af
d1ae3f1
ba53a3e
00e2ccc
5f374f7
b17a9bd
702694b
dcb14e7
6345dc1
2cb15a3
823e2d0
1b66f95
4b760c0
6f4c906
59e2434
2fac3fd
6574992
5e91f92
7b54bdb
9bfcffa
ee45a9a
ea07fae
befd1be
c8fe97d
6e4fbc5
ed92bba
f9ae48d
d818a3a
4f245f0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -56,11 +56,12 @@ dispatch: | |
``` | ||
|
||
### Database Incremental Strategies | ||
Some end models in this package are materialized incrementally. We currently use the `merge` strategy as the default strategy for BigQuery, Snowflake, and Databricks databases. For Redshift and Postgres databases, we use `delete+insert` as the default strategy. | ||
Some of the end models in this package are materialized incrementally. We have chosen `insert_overwrite` as the default strategy for **BigQuery** and **Databricks** databases, as it is only available for these dbt adapters. For **Snowflake**, **Redshift**, and **Postgres** databases, we have chosen `delete+insert` as the default strategy. | ||
|
||
We recognize there are some limitations with these strategies, particularly around updated records in the past which cause duplicates, and are assessing using a different strategy in the future. | ||
`insert_overwrite` is our preferred incremental strategy because it will be able to properly handle updates to records that exist outside the immediate incremental window. That is, because it leverages partitions, `insert_overwrite` will appropriately update existing rows that have been changed upstream instead of inserting duplicates of them--all without requiring a full table scan. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you define There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the doc comments. I am going to rework this tomorrow! |
||
|
||
> For either of these strategies, we highly recommend that users periodically run a `--full-refresh` to ensure a high level of data quality. | ||
`delete+insert` is our second-choice as it resembles `insert_overwrite` but lacks partitions. This strategy works most of the time and appropriately handles incremental loads that do not contain changes to past records. However, if a past record has been updated and is outside of the incremental window, `delete+insert` will insert a duplicate record. 😱 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it might be worth qualifying https://docs.getdbt.com/docs/build/incremental-models#about-incremental_strategy There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @jasongroob and this can be a bit misleading. @fivetran-catfritz can you reword this to be more direct as we discussed earlier. |
||
> Because of this, we highly recommend that **Snowflake**, **Redshift**, and **Postgres** users periodically run a `--full-refresh` to ensure a high level of data quality and remove any possible duplicates. | ||
|
||
## Step 2: Install the package | ||
Include the following mixpanel package version in your `packages.yml` file: | ||
|
@@ -69,7 +70,7 @@ Include the following mixpanel package version in your `packages.yml` file: | |
```yaml | ||
packages: | ||
- package: fivetran/mixpanel | ||
version: [">=0.8.0", "<0.9.0"] # we recommend using ranges to capture non-breaking changes automatically | ||
version: [">=0.9.0", "<0.10.0"] # we recommend using ranges to capture non-breaking changes automatically | ||
fivetran-joemarkiewicz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``` | ||
|
||
## Step 3: Define database and schema variables | ||
|
@@ -82,7 +83,6 @@ vars: | |
``` | ||
|
||
## (Optional) Step 4: Additional configurations | ||
<details><summary>Expand for configurations</summary> | ||
|
||
## Macros | ||
### analyze_funnel [(source)](https://github.com/fivetran/dbt_mixpanel/blob/master/macros/analyze_funnel.sql) | ||
|
@@ -98,7 +98,7 @@ The macro takes the following as arguments: | |
- `event_funnel`: List of event types (not case sensitive). | ||
- Example: `'['play_song', 'stop_song', 'exit']` | ||
- `group_by_column`: (Optional) A column by which you want to segment the funnel (this macro pulls data from the `mixpanel__event` model). The default value is `None`. | ||
- Examaple: `group_by_column = 'country_code'`. | ||
- Example: `group_by_column = 'country_code'`. | ||
fivetran-joemarkiewicz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- `conversion_criteria`: (Optional) A `WHERE` clause that will be applied when selecting from `mixpanel__event`. | ||
- Example: To limit all events in the funnel to the United States, you'd provide `conversion_criteria = 'country_code = "US"'`. To limit the events to only song play events to the US, you'd input `conversion_criteria = 'country_code = "US"' OR event_type != 'play_song'`. | ||
|
||
|
@@ -199,15 +199,13 @@ vars: | |
session_event_criteria: 'event_type in ("play_song", "stop_song", "create_playlist")' | ||
``` | ||
|
||
#### Session Trailing Window | ||
Events can sometimes come late. For example, events triggered on a mobile device that is offline will be sent to Mixpanel once the device reconnects to wifi or a cell network. This makes sessionizing a bit trickier/costlier, as the sessions model (and all final models in this package) is materialized as an incremental table. | ||
|
||
Therefore, to avoid requiring a full refresh to incorporate these delayed events into sessions, the package by default re-sessionizes the most recent 3 hours of events on each run. To change this, add the following variable to your `dbt_project.yml` file: | ||
#### Lookback Window | ||
Events can sometimes arrive late. For example, events triggered on a mobile device that is offline will be sent to Mixpanel once the device reconnects to wifi or a cell network. Since many of the models in this package are incremental, by default we look back 7 days to ensure late arrivals are captured while avoiding requiring a full refresh. To change the default lookback window, add the following variable to your `dbt_project.yml` file: | ||
|
||
```yml | ||
vars: | ||
mixpanel: | ||
sessionization_trailing_window: number_of_hours # ex: 12 | ||
lookback_window: number_of_days # default is 7 | ||
``` | ||
|
||
### Changing the Build Schema | ||
|
@@ -224,7 +222,7 @@ models: | |
### Change the source table references | ||
If an individual source table has a different name than the package expects, add the table name as it appears in your destination to the respective variable: | ||
|
||
> IMPORTANT: See this project's [`dbt_project.yml`](https://github.com/fivetran/dbt_mixpanel_source/blob/main/dbt_project.yml) variable declarations to see the expected names. | ||
> IMPORTANT: See this project's [`dbt_project.yml`](https://github.com/fivetran/dbt_mixpanel/blob/main/dbt_project.yml) variable declarations to see the expected names. | ||
|
||
```yml | ||
vars: | ||
|
@@ -241,8 +239,6 @@ Events are considered duplicates and consolidated by the package if they contain | |
|
||
This is performed in line with Mixpanel's internal de-duplication process, in which events are de-duped at the end of each day. This means that if an event was triggered during an offline session at 11:59 PM and _resent_ when the user came online at 12:01 AM, these records would _not_ be de-duplicated. This is the case in both Mixpanel and the Mixpanel dbt package. | ||
|
||
</details> | ||
|
||
## (Optional) Step 5: Orchestrate your models with Fivetran Transformations for dbt Core™ | ||
<details><summary>Expand for details</summary> | ||
<br> | ||
|
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of this variable should be treated as a breaking change in case users are leveraging this in their current workflow. Would you be able to move this to the breaking change section.