Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unable to Find Source Schemas on Database that is Different From the Target Database (Multi-Database Setup) #126

Open
2 of 4 tasks
Kryt87 opened this issue Dec 2, 2024 · 18 comments
Labels
status:blocked Need additional information or requirements before proceeding type:wontfix This will not be worked on

Comments

@Kryt87
Copy link

Kryt87 commented Dec 2, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

Selecting the source database using relevant <connector>_database (reddit_ads_database etc.) variables in the dbt_project.yml file is not working as expected when the target database is not the same as the database where the data is stored. In this instance many "Please be aware" messages about being unable to find source tables in the database.

After reviewing the debug level logs and running a few tests, I observed that when running the staging models, a query against the Redshift information_schema.tables table is run but is pointing to the information_schema schema on the target database and not the data source database. Which is why the tables cannot be found.

Relevant error log or model output

CLI Output:

Please be aware: The CAMPAIGN_REPORT table was not found in your REDDIT_ADS schema(s). The Fivetran dbt package will create a completely empty CAMPAIGN_REPORT staging model as to not break downstream transformations. To turn off these warnings, set the `fivetran__remove_empty_table_warnings` variable to TRUE (see https://github.com/fivetran/dbt_fivetran_utils/tree/releases/v0.4.latest#union_data-source for details).

Debugging Logs:

select
        table_catalog as database,
        table_name as name,
        table_schema as schema,
        'table' as type
    from information_schema.tables
    where table_schema ilike 'fivetran_reddit_ads'
    and table_type = 'BASE TABLE'
    union all
    select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      case
        when view_definition ilike '%create materialized view%'
          then 'materialized_view'
        else 'view'
      end as type
    from information_schema.views
    where table_schema ilike 'fivetran_reddit_ads'
�[0m01:29:19.613576 [debug] [Thread-3 (]: SQL status: SUCCESS in 0.133 seconds
�[0m01:29:19.614802 [debug] [Thread-3 (]: While listing relations in database=ingestion, schema=fivetran_reddit_ads, found: 

Expected behavior

Expected that the required source tables could be identified in the ingestion database. There is no issue if the target database is the same as the target database.

Possible solution

The redshift__list_relations_without_caching appears to be what is used to list the tables in a schema. Not sure what needs to happen to get it pointing at the correct database.

dbt Project configurations

vars:
  linkedin_ads_database: ingestion
  linkedin_ads_schema: fivetran_linkedin_ads
  pinterest_database: ingestion
  pinterest_schema: fivetran_pinterest_ads
  reddit_ads_database: ingestion
  reddit_ads_schema: fivetran_reddit_ads

Package versions

  • package: dbt-labs/redshift
    version: 0.9.0
  • package: fivetran/ad_reporting
    version: [">=1.10.0", "<1.11.0"]

What database are you using dbt with?

redshift

How are you running this dbt package?

dbt Core™

dbt Version

Core:

  • installed: 1.8.9
  • latest: 1.8.9 - Up to date!

Plugins:

  • redshift: 1.8.1 - Up to date!
  • postgres: 1.8.2 - Up to date!

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance.
  • No.
@Kryt87 Kryt87 added the type:bug Something is broken or incorrect label Dec 2, 2024
@fivetran-joemarkiewicz
Copy link
Contributor

Hi @Kryt87 thanks for opening this issue, but sorry to hear you're running into this problem.

Quick question before diving into what could be happening here. What is the name of your source database you would expect the package to be reading from as opposed to the target? In your reddit_ads_database variable I see you have defined ingestion, and then I can see in the log that ingestion was used for the query where no campaign_report table was found.

While listing relations in database=ingestion, schema=fivetran_reddit_ads

Is the ingestion database the source or is there a different database you would like the package to read from?

@Kryt87
Copy link
Author

Kryt87 commented Dec 2, 2024 via email

@fivetran-joemarkiewicz
Copy link
Contributor

Thanks for clarifying that! It does seem like from the logs that the package is reading from the ingestion database 🤔

Would you be able to confirm that in this case you do see the campaign_report table in the ingestion.fivetran_reddit_ads location in your Redshift destination?

@Kryt87
Copy link
Author

Kryt87 commented Dec 2, 2024

I can confirm that the campaign_report table exists in ingestion.fivetran_reddit_ads.

However, when I ran a few tests I was able to confirm that the packages was not reading from the ingestion database. In a dev target database I created the dummy table dbt_kurt_drew.fivetran_reddit_ads.test_reddit_ads. I then ran it again, where the debugging logs showed:

�[0m03:13:30.546025 [debug] [Thread-1 (]: While listing relations in database=ingestion, schema=fivetran_reddit_ads, found: test_reddit_ads
�[0m03:13:30.549303 [warn ] [Thread-1 (]: 

Please be aware: The CAMPAIGN_REPORT table was not found in your REDDIT_ADS schema(s).

This dummy table does not exist in the ingestion database.

@fivetran-joemarkiewicz
Copy link
Contributor

Thanks for sharing. One last question before diving deeper - do you have any other configs (outside of what you already shared in the issue description) in your dbt_project.yml vars or models sections that could be causing the package to read from the incorrect location?

@Kryt87
Copy link
Author

Kryt87 commented Dec 2, 2024

Nothing that stands out to me 😞

@fivetran-joemarkiewicz
Copy link
Contributor

Thanks for checking @Kryt87, I was able to dig into this deeper and believe I may have found the culprit here.

I was able to recreate this error on Redshift. I did notice that this worked as expected on all other destinations which was a bit odd to me. However, for Redshift I saw the exact same behavior as you mentioned. I then went under the hood and was inspecting these lines which are run during the first query step of the Ad Reporting models and found that the proper source database.schema.table was being used. However, the relation.value would always be None and therefore result in the empty table warning you're seeing.

So it seems the package is interpreting the variables properly, but the actual relation.value is not working as expected. I then decided to hardcode a select * from database.schema.value in one of my models and I immediately saw the below error.

Database Error in model stg_reddit_ads__ad_tmp (models/tmp/stg_reddit_ads__ad_tmp.sql)
  Datasharing is not enabled on this cluster

This would make sense why the variables are working as expected, but then the actual relation is failing since datasharing isn't enabled on my end. Do you by chance have datasharing enabled on your end to facilitate this? I imagine if you're able to enable this then you'll be able see success when reading from one database and write to another.

Let me know if this helps address your issue!

@Kryt87
Copy link
Author

Kryt87 commented Dec 3, 2024

I'm glad you are seeing the same error, I find Redshift can be very particular about this kind of thing. Regarding data sharing, it is enabled for the cluster. However, there are no data shares set up between these two databases and there shouldn't need to be as they are on the same cluster. Data sharing is primarily designed to facilitate sharing data across different Redshift clusters, which can be in different AWS accounts or regions.

When testing, I created both a staging model (view) that sourced the campaign_report and referenced it in an intermediate model. This all worked as expected, even though the source was on the ingestion database and the staging and intermediate models were created on a seperate dev databases.

@fivetran-joemarkiewicz
Copy link
Contributor

I also was confused why the data sharing was not enabled for the same cluster, but that's the error I was seeing.

It's also strange that you're able to query directly from the source in that staging model yet you don't see the error I saw. Could you share your test staging model view and the log result? I'm wondering if something may be causing issues at the adapter.get_relations level within dbt-redshift itself 🤔

@Kryt87
Copy link
Author

Kryt87 commented Dec 3, 2024

I think you are on the right track with dbt-redshift.

The view is rather simple:

with source as (

    select * from {{ source('fivetran_reddit_ads', 'campaign_report') }}

),

renamed as (

    select * from source

)

select * from renamed

With the source being:

sources:
  - name: fivetran_reddit_ads
    description: ""
    database: ingestion

    tables:
      - name: campaign_report
        description: ""

debug.log - slimmed down debugging logs

@fivetran-joemarkiewicz
Copy link
Contributor

Thanks so much for sharing @Kryt87. I've been looking into this more, but struggling on my end since I continue to run into the data sharing error. I want to narrow this down even further to the dbt-redshift adapter and see if the adapter.get_relation is the culprit. Would you be able to try and use the below version in your test model. I would be curious if this works as expected, or if it returns the relation "none" does not exist error I end up seeing.

{% set relation = namespace(value="") %}
{%- set relation.value=adapter.get_relation(
    database=source('fivetran_reddit_ads', 'campaign_report').database,
    schema=source('fivetran_reddit_ads', 'campaign_report').schema,
    identifier='campaign_report'
) -%}

with source as (

    select * from {{ relation.value }}

),

renamed as (

    select * from source

)

select * from renamed

Let me know if this ends up working for you. If it doesn't then we may need to open an issue on the dbt-redshift adapter to have those maintainers help address this issue. If not, then we can narrow this issue down to some component in the package not working as expected.

@Kryt87
Copy link
Author

Kryt87 commented Dec 4, 2024

Maybe I have something set up wrong, but I am getting the following error:

21:37:24    Database Error in model stg_reddit_ads_campaign_report (models/staging/reddit_ads/campaign_report/stg_reddit_ads_campaign_report.sql)
  All the relation names inside should be qualified when creating VIEW WITH NO SCHEMA BINDING.
  compiled code at target/run/data_products/models/staging/reddit_ads/campaign_report/stg_reddit_ads_campaign_report.sql

@Kryt87
Copy link
Author

Kryt87 commented Dec 4, 2024

The compiled code is:

  create view "dbt_kurt_drew"."public"."stg_reddit_ads_campaign_report__dbt_tmp" as (
    with source as (

    select * from None

),

renamed as (

    select * from source

)

select * from renamed
  ) with no schema binding;

@fivetran-joemarkiewicz
Copy link
Contributor

Thanks for sharing @Kryt87, this is promising in the sense that we see in your compiled code that relation.value is returning None as I'm seeing on my end as well. This makes me consider the adapter.get_relations is not working as expected.

However, it is interesting to see the with no schema binding component of the compiled code. Would you be able to share the reason for this addition and how you've added it to be included when materializing models? I'm not entirely sure this is an issue, but wanted to ask since the addition of this does produce a different error from what I'm seeing.

@fivetran-joemarkiewicz
Copy link
Contributor

FYI I have a draft bug report for the dbt-redshift repo that I can send over if we don't find anything obvious with the no schema binding config. I'll hold off submitting that issue until we confirm this is not possibly contributing to this issue.

@Kryt87
Copy link
Author

Kryt87 commented Dec 5, 2024

We have had some issues with table bindings in the past, so in dbt_project.yml we put bind: false under models:

@fivetran-joemarkiewicz
Copy link
Contributor

Thanks for sharing @Kryt87. I don't imagine that config is the cause of these issues. I've gone ahead and opened the above issue on the dbt-redshift repo. Let's move our conversation there for the meantime to see if the maintainers of that repo are able to help shed light on what may be going on. Thanks!

@fivetran-joemarkiewicz
Copy link
Contributor

@Kryt87, I don't know if you've been following the linked dbt-redshift issue above but dbt Labs has confirmed that Redshift does not support cross-db querying 😞. The issue has been consolidated within this feature request on the dbt-redshift repo.

As dbt-redshift does not support cross-db querying, we are blocked from allowing this to work as we would expect within the dbt package. Unfortunately we are limited now to the prioritization of the above mentioned feature request within dbt-redshift.

Since we're blocked, I've opted to create PR #128 to clarify this limitation in the README for users wanting to use this package across databases and that this operation is limited to the dbt adapter's capabilities. Additionally, I'll need to mark this ticket as won't fix since we unfortunately won't be able to address this issue at the dbt package level. I sincerely apologize this doesn't address your original issue, but I encourage you to post in the dbt-redshift feature request linked above to let the folks at dbt Labs know this is an important feature to you.

@fivetran-joemarkiewicz fivetran-joemarkiewicz added type:wontfix This will not be worked on status:blocked Need additional information or requirements before proceeding and removed type:bug Something is broken or incorrect status:scoping Currently being scoped labels Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:blocked Need additional information or requirements before proceeding type:wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants