Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom planning behavior for selecting wildcard expression #11639

Closed
goldmedal opened this issue Jul 24, 2024 · 11 comments · Fixed by #11681
Closed

Allow custom planning behavior for selecting wildcard expression #11639

goldmedal opened this issue Jul 24, 2024 · 11 comments · Fixed by #11681
Assignees
Labels
enhancement New feature or request

Comments

@goldmedal
Copy link
Contributor

Is your feature request related to a problem or challenge?

Currently, DataFusion always expands the wildcard expression to the list of columns when planning a SELECT statement. This causes problems for those who want to check the wildcard expression in the logical plan layer.

For example, given a table "A" with three columns: c1, c2, and c3, it is difficult to distinguish between SELECT * FROM A and SELECT c1, c2, c3 FROM A in the logical plan layer.

If we provide a way for users to customize the behavior of wildcard expression planning, it will be easier to distinguish between these two cases.

Describe the solution you'd like

I plan to implement this feature based on ExprPlanner. I guess we can add a method called plan_select_wildcard and putting the default expression expansion there. Then, invoke this method in SqlToRel::sql_select_to_rex.
https://github.com/apache/datafusion/blob/main/datafusion/sql/src/select.rs#L594

Describe alternatives you've considered

No response

Additional context

My current project, Wren engine, is a semantic engine based on the DataFusion Logical Plan. We have a type of column called a "calculated field" that can be selected specifically but won't be included when querying using the wildcard expression. Since it could trigger some joins for its parent table, we want to ensure that it is only used when the user specifically selects it.
This feature helps us align with the native behavior of DataFusion.

@goldmedal goldmedal added the enhancement New feature or request label Jul 24, 2024
@goldmedal
Copy link
Contributor Author

take

@goldmedal
Copy link
Contributor Author

Hi @alamb, do you know the purpose of this comment?

// TODO: move it into analyzer
let input_schema = plan.schema();
let mut projected_expr = vec![];
for e in expr {
let e = e.into();
match e {
Expr::Wildcard { qualifier: None } => {
projected_expr.extend(expand_wildcard(input_schema, &plan, None)?)
}
Expr::Wildcard {
qualifier: Some(qualifier),
} => projected_expr.extend(expand_qualified_wildcard(
&qualifier,
input_schema,
None,
)?),

I believe this expansion for wildcard duplicates what the planner does.

let expanded_exprs =
expand_wildcard(plan.schema().as_ref(), plan, Some(&options))?;

I found the wildcard expression is still expanded even when I disable this behavior in the planner. I think the reason is that the planner invokes the LogicalPlanner to build the projection, which will expand the wildcard expression.

fn project(&self, input: LogicalPlan, expr: Vec<Expr>) -> Result<LogicalPlan> {

Could we remove the expansion behavior in the builder? (I guess that's what the comment purposes.)

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Could we remove the expansion behavior in the builder? (I guess that's what the comment purposes.)

I think that is my understanding as well

I believe @jayzhan211 and @tshauck have had some recent discussion about this as well, so maybe they remember more than do I

@tshauck
Copy link
Contributor

tshauck commented Jul 26, 2024

We were working on the related COUNT(*). At least in that case the * is just intercepted as the argument and replaced with 1, so I don't think there'd be a conflict between that work and changing the expansion behavior in the projection itself.

@goldmedal
Copy link
Contributor Author

We were working on the related COUNT(*). At least in that case the * is just intercepted as the argument and replaced with 1, so I don't think there'd be a conflict between that work and changing the expansion behavior in the projection itself.

Thanks, @tshauck. I agree there is no conflict between them.

I'm curious, which PR did you do for this? There are duplicate behaviors for the expansion (in the select all case, we will expand the expression twice). I want to check if this behavior can be removed. If the expansion in the builder is required for another case, I will also need to modify it. If not, I think we can just remove it and address the TODO comment.

@tshauck
Copy link
Contributor

tshauck commented Jul 26, 2024

It was part of the work to support COUNT() (note no *): #11229

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 27, 2024

@goldmedal If we move the expression expansion to analyzer as the comment said, I guess you can still tell the difference from SELECT * FROM A and SELECT c1, c2, c3 FROM A before optimizer. I think expand expression in builder is quite early, if there are 1000 columns, we need to carry those columns around in many places. It is better to delay expansion lazily as long as possible.

@goldmedal
Copy link
Contributor Author

@goldmedal If we move the expression expansion to analyzer as the comment said, I guess you can still tell the difference from SELECT * FROM A and SELECT c1, c2, c3 FROM A before optimizer. I think expand expression in builder is quite early, if there are 1000 columns, we need to carry those columns around in many places. It is better to delay expansion lazily as long as possible.

I think I got it. So, ideally, we don't do any expansions in the planner or builder. We need to have an analyze rule or optimize rule to expand the expression. If we move this to analyze or optimizer, I can get the wildcard expression before optimizing, even if we don't add a customized behavior in ExprPlanner, right?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 27, 2024 via email

@goldmedal
Copy link
Contributor Author

I have drafted a version #11681 for moving wildcard expansions to the analyzer. I think it's better than #11673. I might change the purpose of this PR. WDYT @alamb, @jayzhan211 ?

By the way, I'm thinking about handling replace syntax. We can discuss this issue in the PR if you have some ideas.

@jayzhan211
Copy link
Contributor

I think handling replace item in options is a good idea. We store the replace item in Expr::Wildcard in planning stage and convert it to expanded columns with expected replace items in analyzer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants