Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enables Schemas in DuckDB #470

Closed
wants to merge 1 commit into from

Conversation

rlancer
Copy link

@rlancer rlancer commented Mar 6, 2024

Fixes issue where using a DuckDB database with multiple schemas breaks

Current result when running with a DB that has multiple schemas:

image

@jonmmease
Copy link
Collaborator

Thanks for the fix @rlancer! I'd like to work out how to test this for the single and multi schema case (and I'm happy to work on adding tests). Do you know of an easy way to construct a DuckDB connection with multiple schemas?

@jonmmease
Copy link
Collaborator

Looks like the tests that use a single schema are failing: https://github.com/hex-inc/vegafusion/actions/runs/8179198596/job/22368156069?pr=470

@rlancer rlancer marked this pull request as draft March 7, 2024 03:44
@rlancer
Copy link
Author

rlancer commented Mar 7, 2024

@jonmmease thank you! Will look at getting the tests worked out also noticing the same issue the tests are highlighting

@jonmmease
Copy link
Collaborator

Hi @rlancer, anything I can do to help get this over the finish line? Do you know of a self-contained way to construct a DuckDB connection with multiple schemas that triggers this error?

@rlancer
Copy link
Author

rlancer commented Mar 21, 2024

Hi @rlancer, anything I can do to help get this over the finish line? Do you know of a self-contained way to construct a DuckDB connection with multiple schemas that triggers this error?

Any connection to MotherDuck should trigger the error

As far as fixing it I think we would need to change how table paths are resolved ie table://movies to support schemas

@jonmmease
Copy link
Collaborator

Any connection to MotherDuck should trigger the error

I don't have a MotherDuck account, and I wouldn't want to introduce this as a dependency for testing. I was hoping there would be a way to create this situation with a local DuckDB connection.

@rlancer
Copy link
Author

rlancer commented Mar 21, 2024

Any connection to MotherDuck should trigger the error

I don't have a MotherDuck account, and I wouldn't want to introduce this as a dependency for testing. I was hoping there would be a way to create this situation with a local DuckDB connection.

Sure I'll do a repo with just DuckDB

@rlancer
Copy link
Author

rlancer commented Mar 21, 2024

@jonmmease this will repo the issue using only DuckDB:

import vegafusion as vf
import pandas as pd
import altair as alt
import duckdb

conn = duckdb.connect()
conn.sql("create or replace schema schema1; create or replace table schema1.table1 (a int, b int);")

vf.runtime.set_connection(conn)

@jonmmease
Copy link
Collaborator

this will repo the issue using only DuckDB:

perfect!

@jonmmease
Copy link
Collaborator

Looking at this a bit, and I think the approach of referencing tables by URL with the tables:// is going to be a bit messy in the multi-schema case.

The newer alternative, that I haven't documented yet, is to wrap an individual DuckDBPyRelation object in a VegaFusion DuckDbDataset wrapper, then pass this to Altair as if it were a pandas DataFrame. And also enable the "vegafusion" data transformer in Altair.

It looks like this approach works fine with the schema case:

For example:

import vegafusion as vf
from vegafusion.dataset.duckdb import DuckDbDataset
import pandas as pd
import altair as alt
import duckdb

conn = duckdb.connect()
conn.sql("""
create or replace schema schema1;
create or replace table schema1.table1 (a int, b int);
INSERT INTO schema1.table1 VALUES (1, 5);
INSERT INTO schema1.table1 VALUES (2, 8);
""")

rel = conn.sql("select * from schema1.table1")

# Create VegaFusion duckdb dataset with verbose=True so the query gets printed out
ds = DuckDbDataset(rel, verbose=True)

alt.data_transformers.enable("vegafusion")

alt.Chart(DuckDbDataset(rel)).mark_bar().encode(x="a", y="b")
DuckDB Query:
WITH tbl_0 AS (SELECT "a", "b" FROM "tbl"), tbl_1 AS (SELECT row_number() OVER () AS "_vf_order", * FROM tbl_0), tbl_2 AS (SELECT "_vf_order" AS "_vf_order", "a" AS "a", "b" AS "b" FROM tbl_1), tbl_3 AS (SELECT count(0) AS "__count", min("_vf_order") AS "_vf_order", "a" FROM tbl_2 GROUP BY "a"), tbl_4 AS (SELECT "_vf_order", "a", "__count" FROM tbl_3), tbl_5 AS (SELECT * FROM tbl_4 WHERE coalesce("a" IS NOT NULL, false)), tbl_6 AS (SELECT "_vf_order", "__count", "a" FROM tbl_5), tbl_7 AS (SELECT * FROM tbl_6 ORDER BY "_vf_order" ASC NULLS LAST) SELECT "__count", "a" FROM tbl_7

visualization (1)

See if this approach works for you

@rlancer
Copy link
Author

rlancer commented Mar 22, 2024

Thank you, that does solve it!

@jonmmease
Copy link
Collaborator

Thank you, that does solve it!

Awesome. I'm thinking long term we may want to deprecate the special "table://" syntax for referencing a table in a connection since this doesn't scale to multiple schema.

@rlancer rlancer closed this Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants