Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support duplicate column names in Joins in Substrait consumer #11049

Merged
merged 5 commits into from
Jun 24, 2024

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Jun 21, 2024

Which issue does this PR close?

Closes #10815 , and also adds support for joining two VirtualTables with overlapping column names.

Rationale for this change

Substrait doesn't contain aliases inside the plan: neither table aliases (SubqueryAlias), nor column aliases. Columns are only named at the beginning and end of the plan, while tables are not named by substrait at all. Substrait referes to columns only by their relative indices, so that's name-independent.

When we consumer Substrait plans' ReadRels, the created DataFusion tables may or may not have qualifiers. If the tables are NamedTable reads, then DataFusion uses the name as the qualifier. If the tables are VirtualTables, then we have unqualified tables.

This causes trouble in some cases for DataFusion, as DF requires columns to be uniquely named in at least two places (there’s probably more but these are the ones we’ve hit so far):

  1. when joining schemas, in DFSchema::check_names. This throws when there are two duplicate unqualified names, or if a unqualified name matches the name of a qualified name. This was problematic for joining VirtualTables since they would be unqualified.
  2. Interestingly, the (1) does NOT throw when there are two duplicate qualified names. That's why the existing tests were working. However, this results in a schema that contains duplicate fields, which then fail if those fields are actually trying to be used - that's Projects require unique expressions names error in substrait producer/consumer #10815

What changes are included in this PR?

I tried first a slightly different solution in #11048, where I'd create a unique reference for each table at the beginning. However that solution has some gaps (it might not work if there's a project between join and a read, e.g.) and is a bit nasty overall.

While thinking about it, I concluded there might be a simpler solution: just (re)qualify the tables before joining them. This is what this PR does. We check, before a join, if there are columns that appear both on the left and right sides, and if there are, we requalify both sides (to left and right). As Substrait only cares about the position of the column, the (re)qualifying doesn't affect the rest of the plan (other than the qualifiers of the output columns).

This ended up looking much better to me. The table qualifiers do now change within the plan created from Substrait, but only if necessary to avoid conflicts.

This PR also separately fixes a case in the final schema aliasing where we'd unnecessarily add aliases to columns to drop table qualifiers (ie. if our plan had a column table.a but the Substrait names defines the name to be a, we'd alias that: table.a AS a. But that's unnecessary since we should only care about the name, not the qualifier.

Are these changes tested?

Tested by some stricter roundtrip unit tests, + adapting existing tests.

Are there any user-facing changes?

Plans produced by Substrait consumer may get different qualifiers for output columns. I think that's unlikely to affect anyone since I dunno why people would be looking at those qualifiers.

Otherwise just improved support for Substrait plans.

let new_name = new_field.name();
match &new_expr {
Expr::Column(c) => {
// If expr is a column reference, alias_if_changed would cause an aliasing if the old expr has a qualifier
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always calling alias_if_changed was causing things like .. table.a AS a in the final project, which is just unnecessary and slightly annoying in some testing cases (but shouldn't matter much in reality since the name is the same).

@@ -628,32 +660,6 @@ async fn qualified_catalog_schema_table_reference() -> Result<()> {
roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await
}

#[tokio::test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these tests are fully replaced by the stricter roundtrip_self_join tests I added above.

These tests worked before since they hit the case (1) - the tables have overlapping names but they're both qualified, so the DFSchema::check_names() doesn't mind, and they don't read those overlapping columns in the final result.

They would pass with the changes here as well, they'd just have additional SubqueryAliases.

@@ -707,20 +713,17 @@ async fn roundtrip_literal_struct() -> Result<()> {
#[tokio::test]
async fn roundtrip_values() -> Result<()> {
// TODO: would be nice to have a struct inside the LargeList, but arrow_cast doesn't support that currently
let values = "(\
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are unrelated, but I was annoyed about the two assert inside this test while developing and one of them was failing - so I split them out.

There was no real benefit to using a large list of types for the empty relation.

// Test LogicalPlan::EmptyRelation
roundtrip(format!("SELECT * FROM (VALUES {values}) LIMIT 0").as_str()).await
#[tokio::test]
async fn roundtrip_values_duplicate_column_join() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test for the case (2)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Blizzara -- looks great to me

cc @waynexia

@@ -317,16 +317,27 @@ fn rename_expressions(
.into_iter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As rename_expressions is getting more feature-full perhaps we can add some doc comments explaining what it is doing (I think it is basically renaming expressions from substrait to match DataFusion's expectations?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added some docs in 87829ec

@@ -846,6 +861,24 @@ pub async fn from_substrait_rel(
}
}

fn requalify_sides_if_needed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could add some of the excellent context from this PR's description as docstrings here to help future readers understand what is going on

This is not strictly necessary as I think the comments in the function explain it, but docstrings would still be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a short version in 87829ec

@Blizzara Blizzara force-pushed the avo/substrait-join-qualify-sides branch from f830339 to 87829ec Compare June 24, 2024 07:19
@Blizzara
Copy link
Contributor Author

Blizzara commented Jun 24, 2024

Thanks for the review @alamb - I added some docs in 87829ec, and also cleaned/improved stuff a bit.

However I also realized this approach falls short form providing full guarantee of uniqueness either, since the left or right side itself may contain duplicate names with different qualifiers, which then would turn into full duplicates once the side gets aliased. Unfortunately I don't see a way around that currently other than adding a lot more complexity into the renaming to basically qualify each column separately or something. (Or switching DataFusion to use something else than column name to refer to columns, e.g. position or some unique expr ID, but that's probably quite a big change 😅 ) Luckily I don't expect that to be very common case, so maybe we can go with just this for now and then improve on it later if needed.

@alamb
Copy link
Contributor

alamb commented Jun 24, 2024

Unfortunately I don't see a way around that currently other than adding a lot more complexity into the renaming to basically qualify each column separately or something. (Or switching DataFusion to use something else than column name to refer to columns, e.g. position or some unique expr ID, but that's probably quite a big change 😅 )

I agree this is a fundamental design difference (identifying columns by string or by offset). I think what you have in this PR is good enough for now, and we can figure out how to improve the situation if it turns out handling duplicated names within a relation is a common case.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks agian @Blizzara

@alamb
Copy link
Contributor

alamb commented Jun 24, 2024

I pushed a commit up to this PR to get CI passing

@Blizzara
Copy link
Contributor Author

Blizzara commented Jun 24, 2024

I agree this is a fundamental design difference (identifying columns by string or by offset). I think what you have in this PR is good enough for now, and we can figure out how to improve the situation if it turns out handling duplicated names within a relation is a common case.

Yup

I pushed a commit up to this PR to get CI passing

Ah thanks, I hadn't noticed that failure, looks good.

Thanks agian @Blizzara

My pleasure :)

@alamb alamb merged commit 3ff0bfe into apache:main Jun 24, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 24, 2024

🚀

@Blizzara Blizzara deleted the avo/substrait-join-qualify-sides branch June 28, 2024 07:14
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…pache#11049)

* add tests for joining tables with same name

* alias tables just before joins instead, for a simpler solution

* test cleanup

* docstrings and some cleanup

* Clippy

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Projects require unique expressions names error in substrait producer/consumer
2 participants