Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] MicrobatchBuilder generates invalid table names with spaces for hourly batches #11165

Open
2 tasks done
pei0804 opened this issue Dec 18, 2024 · 3 comments
Open
2 tasks done
Labels
bug Something isn't working microbatch Issues related to the microbatch incremental strategy

Comments

@pei0804
Copy link

pei0804 commented Dec 18, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When using incremental_strategy="microbatch" with batch_size="hour", MicrobatchBuilder generates a batch ID that contains spaces and special characters. This leads to invalid temporary table names and SQL syntax errors in database adapters.

Example error from Snowflake adapter:

SQL compilation error: syntax error line 1 at position 150 unexpected '00'.

The error occurs because the generated temporary table name contains spaces and timezone information:

create or replace temporary table [...].model_name__dbt_tmp_20241218 00:00:00+00:00

CleanShot 2024-12-18 at 11 39 54@2x

Expected Behavior

MicrobatchBuilder should generate a valid batch ID without spaces or special characters for hourly batches, similar to how it handles daily batches.

For example:

create or replace temporary table [...].model_name__dbt_tmp_20241218T000000Z

Steps To Reproduce

  1. Create a model with microbatch incremental strategy:
{{
    config(
        materialized="incremental",
        incremental_strategy="microbatch",
        begin="2022-01-01T00:00:00",
        batch_size="hour",
        event_time="_partition_hourly",
        unique_key="_partition_hourly",
        lookback=1,
    )
}}
  1. Run the model with dbt run --select hoge --event-time-start "2024-12-18T00:00:00" --event-time-end "2024-12-18T01:00:00"
  2. Observe the SQL syntax error due to invalid temporary table name

Relevant log output

02:13:16  Using snowflake connection "model.fuga.hoge"
02:13:16  On model.fuga.hoge: create or replace temporary table dbname.schemaname.modelname__dbt_tmp_20241218 00:00:00+00:00
         as
        (

with
    source as (

        select *
        from dbname.schemaname.modelname

    ),

    deduplicated as (
        select *
    from source
    qualify
        row_number() over (
            partition by id
            order by _partition_hourly asc, bid_at asc
        ) = 1
    )

select *
from deduplicated
        )
02:13:17  Snowflake adapter: Snowflake query id: hoge
02:13:17  Snowflake adapter: Snowflake error: 001003 (42000): SQL compilation error:
syntax error line 1 at position 150 unexpected '00'.
02:13:17  Unhandled error while executing
Exception on worker thread. Database Error
  001003 (42000): SQL compilation error:
  syntax error line 1 at position 150 unexpected '00'.
02:13:17  Batch 1 of 1 ERROR creating batch 2024-12-18 00:00:00+00:00 of dbname.schemaname.modelname [ERROR in 0.39s]

Environment

- OS: macOS Sonoma 14.6.1
- Python: 3.12.6
- dbt-core: 1.9.1
- dbt-snowflake: 1.9.0

Which database adapter are you using with dbt?

snowflake

Additional Context

The issue occurs in several parts of the codebase:

  1. First, the batch ID is set in core/dbt/task/run.py:
    https://github.com/dbt-labs/dbt-core/blob/main/core/dbt/task/run.py#L352-L356
self.node.batch = BatchContext(
    id=MicrobatchBuilder.batch_id(batch[0], self.node.config.batch_size),
    event_time_start=batch[0],
    event_time_end=batch[1],
)
  1. This batch ID is then used in the global make_temp_relation macro to create unique temporary table names:
    https://github.com/dbt-labs/dbt-adapters/blob/5407391c5cef22a5c0431daa469d6a8295c026d8/dbt/include/global_project/macros/adapters/relation.sql#L9-L16
{% macro make_temp_relation(base_relation, suffix='__dbt_tmp') %}
  {#-- This ensures microbatch batches get unique temp relations to avoid clobbering --#}
  {% if suffix == '__dbt_tmp' and model.batch %}
    {% set suffix = suffix ~ '_' ~ model.batch.id %}
  {% endif %}
  {{ return(adapter.dispatch('make_temp_relation', 'dbt')(base_relation, suffix)) }}
{% endmacro %}
  1. The problematic batch ID generation occurs in the MicrobatchBuilder class:
    @staticmethod
    def batch_id(start_time: datetime, batch_size: BatchSize) -> str:
    return MicrobatchBuilder.format_batch_start(start_time, batch_size).replace("-", "")
    @staticmethod
    def format_batch_start(batch_start: datetime, batch_size: BatchSize) -> str:
    return str(
    batch_start.date() if (batch_start and batch_size != BatchSize.hour) else batch_start
    )
@staticmethod
def batch_id(start_time: datetime, batch_size: BatchSize) -> str:
    return MicrobatchBuilder.format_batch_start(start_time, batch_size).replace("-", "")

@staticmethod
def format_batch_start(batch_start: datetime, batch_size: BatchSize) -> str:
    return str(
        batch_start.date() if (batch_start and batch_size != BatchSize.hour) else batch_start
    )

For hourly batches (batch_size="hour"), format_batch_start returns str(batch_start) which generates a datetime string like "2024-12-18 00:00:00+00:00". While batch_id removes hyphens, it does not handle spaces and timezone information, resulting in an invalid table name.

For non-hourly batches (day/month/year), it correctly uses batch_start.date() which produces a clean format like "2024-12-18", and after removing hyphens becomes "20241218".

The issue stems from the fact that the batch ID flows from run.py through the Jinja templating system and into SQL table names without proper sanitization for hourly batches. The fix would likely involve modifying the format_batch_start method to ensure hourly timestamps use a database-friendly format (e.g., ISO format "20241218T000000Z") similar to how it handles daily batches.

@pei0804 pei0804 added bug Something isn't working triage labels Dec 18, 2024
@yanithx
Copy link

yanithx commented Dec 20, 2024

I am experiencing the same issue!

SQL compilation error: syntax error line 1 at position 87 unexpected '12'.

create or replace  temporary view RAW.xxx.campaign_email_events__dbt_tmp_20241220 12:00:00+00:00
  
   as (
...    

@syakesaba
Copy link

+1

@dbeatty10 dbeatty10 added the microbatch Issues related to the microbatch incremental strategy label Jan 10, 2025
@dbeatty10
Copy link
Contributor

Thanks for reporting this @pei0804 (and to @yanithx and @syakesaba for confirming)!

I was able to reproduce this issue -- see "reprex" below for details.

Potential fix

@pei0804 very nice research on the relevant areas of code 🤩.

Making the following modifications to this code worked when I tested it out -- although our final fix might look a bit different:

    @staticmethod
    def batch_id(start_time: datetime, batch_size: BatchSize) -> str:
        return MicrobatchBuilder.format_batch_start(start_time, batch_size)

    @staticmethod
    def format_batch_start(batch_start: datetime, batch_size: BatchSize) -> str:        
        # If we want a date only
        if batch_size != BatchSize.hour:
            return batch_start.strftime('%Y%m%d')  # e.g. "20241218"

        # If we want date + time
        return batch_start.strftime('%Y%m%dT%H%M%SZ')  # e.g. "20241218T000000Z"

This changed the SQL from this:

create or replace temporary table analytics_dev.dbt_dbeatty.my_microbatch_model__dbt_tmp_20250110 14:00:00+00:00

to this instead:

create or replace temporary table analytics_dev.dbt_dbeatty.my_microbatch_model__dbt_tmp_20250110T140000Z

Note:

  • I didn't test out what would happen if the batch_size is something else besides hour.
  • also didn't test with positive or negative UTC offsets or naive timestamps

Repex

Create this file:

models/my_microbatch_model.sql

{{
    config(
        materialized="incremental",
        incremental_strategy="microbatch",
        begin="2025-01-10T00:00:00",
        batch_size="hour",
        event_time="_partition_hourly",
        unique_key="_partition_hourly",
        lookback=1,
    )
}}

select 1 as id, {{ dbt.current_timestamp() }} as _partition_hourly

Run this command:

dbt run -s my_microbatch_model

Get this error:

15:50:51  1 of 1 START sql microbatch model dbt_dbeatty.my_microbatch_model .............. [RUN]
15:50:51  Batch 1 of 2 START batch 2025-01-10 14:00:00+00:00 of dbt_dbeatty.my_microbatch_model  [RUN]
15:50:52  Unhandled error while executing 
Exception on worker thread. Database Error
  001003 (42000): SQL compilation error:
  syntax error line 1 at position 98 unexpected '14'.

@dbeatty10 dbeatty10 removed the triage label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working microbatch Issues related to the microbatch incremental strategy
Projects
None yet
Development

No branches or pull requests

4 participants