fivetran · fivetran-catfritz · Feb 21, 2024 · Dec 5, 2023 · Jan 2, 2024 · Jan 3, 2024
diff --git a/.github/PULL_REQUEST_TEMPLATE/maintainer_pull_request_template.md b/.github/PULL_REQUEST_TEMPLATE/maintainer_pull_request_template.md
@@ -4,48 +4,27 @@
 **This PR will result in the following new package version:**
 <!--- Please add details around your decision for breaking vs non-breaking version upgrade. If this is a breaking change, were backwards-compatible options explored? -->
 
-**Please detail what change(s) this PR introduces and any additional information that should be known during the review of this PR:**
+**Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:**
+<!--- Copy/paste the CHANGELOG for this version below. -->
 
 ## PR Checklist
 ### Basic Validation
 Please acknowledge that you have successfully performed the following commands locally:
-- [ ] dbt compile
-- [ ] dbt run –full-refresh
-- [ ] dbt run
-- [ ] dbt test
-- [ ] dbt run –vars (if applicable)
+- [ ] dbt run –full-refresh && dbt test
+- [ ] dbt run (if incremental models are present) && dbt test
 
 Before marking this PR as "ready for review" the following have been applied:
-- [ ] The appropriate issue has been linked and tagged
-- [ ] You are assigned to the corresponding issue and this PR
-- [ ] BuildKite integration tests are passing
+- [ ] The appropriate issue has been linked, tagged, and properly assigned.
+- [ ] All necessary documentation and version upgrades have been applied.
+    <!--- Be sure to update the package version in the dbt_project.yml, integration_tests/dbt_project.yml, and README if necessary. -->
+- [ ] docs were regenerated (unless this PR does not include any code or yml updates).
+- [ ] BuildKite integration tests are passing.
+- [ ] Detailed validation steps have been provided below.
 
 ### Detailed Validation
-Please acknowledge that the following validation checks have been performed prior to marking this PR as "ready for review":
-- [ ] You have validated these changes and assure this PR will address the respective Issue/Feature.
-- [ ] You are reasonably confident these changes will not impact any other components of this package or any dependent packages.
-- [ ] You have provided details below around the validation steps performed to gain confidence in these changes.
+Please share any and all of your validation steps:
 <!--- Provide the steps you took to validate your changes below. -->
 
-### Standard Updates
-Please acknowledge that your PR contains the following standard updates:
-- Package versioning has been appropriately indexed in the following locations:
-    - [ ] indexed within dbt_project.yml
-    - [ ] indexed within integration_tests/dbt_project.yml
-- [ ] CHANGELOG has individual entries for each respective change in this PR
-    <!--- If there is a parallel upstream change, remember to reference the corresponding CHANGELOG as an individual entry.  -->
-- [ ] README updates have been applied (if applicable)
-    <!--- Remember to check the following README locations for common updates. →
-        <!--- Suggested install range (needed for breaking changes) →
-        <!--- Dependency matrix is appropriately updated (if applicable) →
-        <!--- New variable documentation (if applicable) -->
-- [ ] DECISIONLOG updates have been updated (if applicable)
-- [ ] Appropriate yml documentation has been added (if applicable)
-
-### dbt Docs
-Please acknowledge that after the above were all completed the below were applied to your branch:
-- [ ] docs were regenerated (unless this PR does not include any code or yml updates)
-
 ### If you had to summarize this PR in an emoji, which would it be?
 <!--- For a complete list of markdown compatible emojis check our this git repo (https://gist.github.com/rxaviers/7360908)  --> 
 :dancer:
diff --git a/.quickstart/quickstart.yml b/.quickstart/quickstart.yml
@@ -0,0 +1,10 @@
+database_key: mixpanel_database
+schema_key: mixpanel_schema
+
+dbt_versions: ">=1.3.0 <2.0.0"
+
+destination_configurations:
+  databricks:
+    dispatch:
+      - macro_namespace: dbt_utils
+        search_order: [ 'spark_utils', 'dbt_utils' ]
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,17 @@
+# dbt_mixpanel v0.9.0
+[PR #41](https://github.com/fivetran/dbt_mixpanel/pull/41) includes the following updates:
+
+## 🚨 Breaking Changes 🚨
+>Note: This update was made breaking since it will alter the materialization of existing models. While these changes do not necessitate a `--full-refresh`, it may be beneficial if you run into issues with this update.
+- Updated models with the following performance improvements:
+  - Update the incremental strategy for all models to `insert_overwrite` for BigQuery and Databricks and `delete+insert` for all other warehouses.
+  - Removed `stg_mixpanel__event_tmp` in favor of `stg_mixpanel__event_tmp`, which is now an incremental model. While this will increase storage, this change was made to improve compute.
+
+## Feature Updates
+- Added `cluster_by` columns to the configs for incremental models. This will benefit Snowflake and BigQuery users. 
+- Added column `dbt_run_date` to incremental models to improve accuracy and optimize downstream models. This date captures the date a record was added or updated by this package.
+- Added a 7-day look-back to incremental models to accommodate late arriving events. 
+
 # dbt_mixpanel v0.8.0
 >Note: If you run into issues with this update, we suggest to try a **full refresh**.
 ## 🎉 Feature Updates 🎉

diff --git a/README.md b/README.md
@@ -56,11 +56,12 @@ dispatch:
 ```
 
 ### Database Incremental Strategies 
-Some end models in this package are materialized incrementally. We currently use the `merge` strategy as the default strategy for BigQuery, Snowflake, and Databricks databases. For Redshift and Postgres databases, we use `delete+insert` as the default strategy.
+Some of the end models in this package are materialized incrementally. We have chosen `insert_overwrite` as the default strategy for **BigQuery** and **Databricks** databases, as it is only available for these dbt adapters. For **Snowflake**, **Redshift**, and **Postgres** databases, we have chosen `delete+insert` as the default strategy.
 
-We recognize there are some limitations with these strategies, particularly around updated records in the past which cause duplicates, and are assessing using a different strategy in the future.
+`insert_overwrite` is our preferred incremental strategy because it will be able to properly handle updates to records that exist outside the immediate incremental window. That is, because it leverages partitions, `insert_overwrite` will appropriately update existing rows that have been changed upstream instead of inserting duplicates of them--all without requiring a full table scan.
 
-> For either of these strategies, we highly recommend that users periodically run a `--full-refresh` to ensure a high level of data quality.
+`delete+insert` is our second-choice as it resembles `insert_overwrite` but lacks partitions. This strategy works most of the time and appropriately handles incremental loads that do not contain changes to past records. However, if a past record has been updated and is outside of the incremental window, `delete+insert` will insert a duplicate record. 😱
+> Because of this, we highly recommend that **Snowflake**, **Redshift**, and **Postgres** users periodically run a `--full-refresh` to ensure a high level of data quality and remove any possible duplicates.
 
 ## Step 2: Install the package
 Include the following mixpanel package version in your `packages.yml` file:
@@ -69,7 +70,7 @@ Include the following mixpanel package version in your `packages.yml` file:
 ```yaml
 packages:
   - package: fivetran/mixpanel
-    version: [">=0.8.0", "<0.9.0"] # we recommend using ranges to capture non-breaking changes automatically
+    version: [">=0.9.0", "<0.10.0"] # we recommend using ranges to capture non-breaking changes automatically
 ```
 
 ## Step 3: Define database and schema variables
@@ -82,7 +83,6 @@ vars:
 ```
 
 ## (Optional) Step 4: Additional configurations
-<details><summary>Expand for configurations</summary>
 
 ## Macros
 ### analyze_funnel [(source)](https://github.com/fivetran/dbt_mixpanel/blob/master/macros/analyze_funnel.sql)
@@ -98,7 +98,7 @@ The macro takes the following as arguments:
 - `event_funnel`: List of event types (not case sensitive). 
   - Example: `'['play_song', 'stop_song', 'exit']`
 - `group_by_column`: (Optional) A column by which you want to segment the funnel (this macro pulls data from the `mixpanel__event` model). The default value is `None`. 
-  - Examaple: `group_by_column = 'country_code'`.
+  - Example: `group_by_column = 'country_code'`.
 - `conversion_criteria`: (Optional) A `WHERE` clause that will be applied when selecting from `mixpanel__event`. 
   - Example: To limit all events in the funnel to the United States, you'd provide `conversion_criteria = 'country_code = "US"'`. To limit the events to only song play events to the US, you'd input `conversion_criteria = 'country_code = "US"' OR event_type != 'play_song'`.
 
@@ -224,7 +224,7 @@ models:
 ### Change the source table references
 If an individual source table has a different name than the package expects, add the table name as it appears in your destination to the respective variable:
 
-> IMPORTANT: See this project's [`dbt_project.yml`](https://github.com/fivetran/dbt_mixpanel_source/blob/main/dbt_project.yml) variable declarations to see the expected names.
+> IMPORTANT: See this project's [`dbt_project.yml`](https://github.com/fivetran/dbt_mixpanel/blob/main/dbt_project.yml) variable declarations to see the expected names.
 
 ```yml
 vars:
@@ -241,8 +241,6 @@ Events are considered duplicates and consolidated by the package if they contain
 
 This is performed in line with Mixpanel's internal de-duplication process, in which events are de-duped at the end of each day. This means that if an event was triggered during an offline session at 11:59 PM and _resent_ when the user came online at 12:01 AM, these records would _not_ be de-duplicated. This is the case in both Mixpanel and the Mixpanel dbt package.
 
-</details>
-
 ## (Optional) Step 5: Orchestrate your models with Fivetran Transformations for dbt Core™
 <details><summary>Expand for details</summary>
 <br>

diff --git a/dbt_project.yml b/dbt_project.yml
@@ -1,6 +1,6 @@
 config-version: 2
 name: 'mixpanel'
-version: '0.8.0'
+version: '0.9.0'
 require-dbt-version: [">=1.3.0", "<2.0.0"]
 models:
   mixpanel:

diff --git a/docs/catalog.json b/docs/catalog.json
diff --git a/docs/index.html b/docs/index.html
diff --git a/docs/manifest.json b/docs/manifest.json
diff --git a/docs/run_results.json b/docs/run_results.json
diff --git a/integration_tests/dbt_project.yml b/integration_tests/dbt_project.yml
@@ -1,5 +1,5 @@
 name: 'mixpanel_integration_tests'
-version: '0.8.0'
+version: '0.9.0'
 config-version: 2
 profile: 'integration_tests'
 vars:

diff --git a/macros/date_today.sql b/macros/date_today.sql
@@ -0,0 +1,6 @@
+{% macro date_today(col_name) %}
+
+cast( {{ dbt.date_trunc('day', dbt.current_timestamp_backcompat()) }} as date) as {{ col_name }}
+{# cast( '2024-02-06' as date) as {{ col_name }} -- for testing #}
+
+{% endmacro %}
diff --git a/macros/lookback.sql b/macros/lookback.sql
@@ -0,0 +1,9 @@
+{% macro lookback(from_date, datepart='day', interval=7, default_start_date='2010-01-01') %}
+
+coalesce(
+    (select {{ dbt.dateadd(datepart=datepart, interval=-interval, from_date_or_timestamp=from_date) }} 
+        from {{ this }}), 
+    {{ "'" ~ default_start_date ~ "'" }}
+    )
+
+{% endmacro %}
diff --git a/macros/staging_columns.sql b/macros/staging_columns.sql
@@ -3,6 +3,7 @@
 {% set columns = [
 
     {"name": "_fivetran_synced", "datatype": dbt.type_timestamp()},
+    {"name": "_fivetran_id", "datatype": dbt.type_string()},
     {"name": "ae_session_length", "datatype": dbt.type_string(), "alias": "app_session_length"},
     {"name": "app_build_number", "datatype": dbt.type_string()},
 

diff --git a/models/mixpanel.yml b/models/mixpanel.yml
@@ -8,17 +8,21 @@ models:
 
       Default materialization is incremental.
 
-    columns: 
+    columns:
       - name: unique_event_id
-        description: > 
+        description: >
           Unique ID of the event. Events are de-duped according to Mixpanel's [requirements](https://developer.mixpanel.com/reference/http#event-deduplication).
           This is hashed on `insert_id`, `people_id`, `date_day`, and `event_type` 
         tests:
           - unique
           - not_null 
 
+      - name: _fivetran_id
+        description: >
+          Hash of `insert_id`, `distinct_id`, and `name` columns.
+
       - name: insert_id
-        description: > 
+        description: >
           Random 16 character string of alphanumeric characters that is unique to an event. 
           Used to de-duplicate data. 
 

diff --git a/models/mixpanel__daily_events.sql b/models/mixpanel__daily_events.sql
@@ -2,9 +2,15 @@
     config(
         materialized='incremental',
         unique_key='unique_key',
-        partition_by={'field': 'date_day', 'data_type': 'date'} if target.type not in ('spark','databricks') else ['date_day'],
-        incremental_strategy = 'merge' if target.type not in ('postgres', 'redshift') else 'delete+insert',
-        file_format = 'delta' 
+        incremental_strategy='insert_overwrite' if target.type in ('bigquery', 'spark', 'databricks') else 'delete+insert',
+        partition_by={
+            "field": "date_day", 
+            "data_type": "date"
+            } if target.type not in ('spark','databricks') 
+            else ['date_day'],
+        cluster_by=['date_day', 'event_type'],
+        file_format='parquet',
+        on_schema_change='append_new_columns'
     )
 }}
 
@@ -20,13 +26,8 @@ with events as (
     from {{ ref('mixpanel__event') }}
 
     {% if is_incremental() %}
-
-    -- we look at the most recent 28 days for this model's window functions to compute properly
-    where date_day >= coalesce( ( select {{ dbt.dateadd(datepart='day', interval=-27, from_date_or_timestamp="max(date_day)") }}  
-                                from {{ this }} ), '2010-01-01')
-
+    where date_day >= {{ mixpanel.lookback(from_date="max(date_day)", interval=27) }}
     {% endif %}
-
 ),
 
 
@@ -36,13 +37,8 @@ date_spine as (
     from {{ ref('stg_mixpanel__user_event_date_spine') }}
 
     {% if is_incremental() %}
-
-    -- look backward for the last 28 days
-    where date_day >= coalesce((select {{ dbt.dateadd(datepart='day', interval=-27, from_date_or_timestamp="max(date_day)") }}  
-                                from {{ this }} ), '2010-01-01')
-
+    where date_day >= {{ mixpanel.lookback(from_date="max(date_day)", interval=27) }}
     {% endif %}
-
 ), 
 
 agg_user_events as (
@@ -55,7 +51,6 @@ agg_user_events as (
 
     from events
     group by 1,2,3
-
 ), 
 
 -- join the spine with event metrics
@@ -74,7 +69,6 @@ spine_joined as (
         on agg_user_events.date_day = date_spine.date_day
         and agg_user_events.people_id = date_spine.people_id
         and agg_user_events.event_type = date_spine.event_type
-
 ), 
 
 trailing_events as (
@@ -89,7 +83,6 @@ trailing_events as (
             and number_of_events > 0 as is_repeat_user
 
     from spine_joined
-
 ), 
 
 agg_event_days as (
@@ -109,7 +102,6 @@ agg_event_days as (
 
     from trailing_events
     group by 1,2
-
 ),
 
 final as (
@@ -127,18 +119,14 @@ final as (
         number_of_users - number_of_new_users - number_of_repeat_users as number_of_return_users,
         trailing_users_28d,
         trailing_users_7d,
-        event_type || '-' || date_day as unique_key
+        {{ dbt_utils.generate_surrogate_key(['event_type', 'date_day']) }} as unique_key,
+        {{ mixpanel.date_today('dbt_run_date') }}
 
     from agg_event_days
 
     {% if is_incremental() %}
-
-    -- only return the most recent day of data
-    where date_day >= coalesce( (select max(date_day)  from {{ this }} ), '2010-01-01')
-
+    where date_day >= {{ mixpanel.lookback(from_date="max(dbt_run_date)") }}
     {% endif %}
-
-    order by date_day desc, event_type
 )
 
 select *

diff --git a/models/mixpanel__event.sql b/models/mixpanel__event.sql
@@ -2,29 +2,31 @@
     config(
         materialized='incremental',
         unique_key='unique_event_id',
-        partition_by={'field': 'date_day', 'data_type': 'date'} if target.type not in ('spark','databricks') else ['date_day'],
-        incremental_strategy = 'merge' if target.type not in ('postgres', 'redshift') else 'delete+insert',
-        file_format = 'delta' 
+        incremental_strategy='insert_overwrite' if target.type in ('bigquery', 'spark', 'databricks') else 'delete+insert',
+        partition_by={
+            "field": "date_day", 
+            "data_type": "date"
+            } if target.type not in ('spark','databricks') 
+            else ['date_day'],
+        cluster_by=['date_day', 'event_type', 'people_id'],
+        file_format='parquet',
+        on_schema_change='append_new_columns'
     )
 }}
 
 with stg_event as (
 
     select *
-
     from {{ ref('stg_mixpanel__event') }}
 
     where 
-    {% if is_incremental() %}
 
-    -- events are only eligible for de-duping if they occurred on the same calendar day 
-    occurred_at >= coalesce((select cast( max(date_day) as {{ dbt.type_timestamp() }} ) from {{ this }} ), '2010-01-01')
+    {% if is_incremental() %}
+    dbt_run_date >= {{ mixpanel.lookback(from_date="max(dbt_run_date)", interval=1) }}
 
     {% else %}
-
     -- limit date range on the first run / refresh
     occurred_at >= {{ "'" ~ var('date_range_start',  '2010-01-01') ~ "'" }} 
-
     {% endif %}
 ),
 
@@ -51,8 +53,8 @@ pivot_properties as (
 
     select 
         *
-        {% if var('event_properties_to_pivot') %},
-        {{ fivetran_utils.pivot_json_extract(string = 'event_properties', list_of_properties = var('event_properties_to_pivot')) }}
+        {% if var('event_properties_to_pivot') %}
+        , {{ fivetran_utils.pivot_json_extract(string = 'event_properties', list_of_properties = var('event_properties_to_pivot')) }}
         {% endif %}
 
     from dedupe