Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planner: Prevent OR Expressions in IN Subquery Conversion to INNER JOIN to Avoid Data Duplication #57584

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dash12653
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #57390

Problem Summary:

What changed and how does it work?

This is the plan:

tidb> explain 
SELECT c0, c1 
FROM t1 
WHERE (c1 OR c1) IN (SELECT c1 FROM t1 WHERE (c1 <= (0x991D3FA2F9C))) AND ((c0 AND 8.98447659672538e+29));
+--------------------------------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                             | estRows | task      | access object | operator info                                                                                                                                                                       |
+--------------------------------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| HashJoin_11                    | 2.66    | root      |               | semi join, equal:[eq(Column#7, Column#8)]                                                                                                                                           |
| ├─Projection_16(Build)         | 1.66    | root      |               | cast(fix_57390.t1.c1, double BINARY)->Column#8                                                                                                                                      |
| │ └─TableReader_19             | 1.66    | root      |               | data:Selection_18                                                                                                                                                                   |
| │   └─Selection_18             | 1.66    | cop[tikv] |               | le(fix_57390.t1.c1, "0x0991d3fa2f9c")                                                                                                                                               |
| │     └─TableFullScan_17       | 5.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                                                                                                                                      |
| └─Projection_12(Probe)         | 3.33    | root      |               | fix_57390.t1.c0, fix_57390.t1.c1, cast(or(istrue_with_null(cast(fix_57390.t1.c1, double BINARY)), istrue_with_null(cast(fix_57390.t1.c1, double BINARY))), double BINARY)->Column#7 |
|   └─TableReader_15             | 3.33    | root      |               | data:Selection_14                                                                                                                                                                   |
|     └─Selection_14             | 3.33    | cop[tikv] |               | fix_57390.t1.c0                                                                                                                                                                     |
|       └─TableFullScan_13       | 5.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                                                                                                                                      |
+--------------------------------+---------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

When tidb encounters an IN subquery, it tries to rewrite the outer table and subquery as INNER JOIN.

// If it's not the form of `not in (SUBQUERY)`,
// and has no correlated column from the current level plan(if the correlated column is from upper level,
// we can treat it as constant, because the upper LogicalApply cannot be eliminated since current node is a join node),
// and don't need to append a scalar value, we can rewrite it to inner join.
if planCtx.builder.ctx.GetSessionVars().GetAllowInSubqToJoinAndAgg() && !v.Not && !asScalar && len(corCols) == 0 && collFlag {
// We need to try to eliminate the agg and the projection produced by this operation.
planCtx.builder.optFlag |= rule.FlagEliminateAgg
planCtx.builder.optFlag |= rule.FlagEliminateProjection
planCtx.builder.optFlag |= rule.FlagJoinReOrder
// Build distinct for the inner query.
agg, err := planCtx.builder.buildDistinct(np, np.Schema().Len())
if err != nil {
er.err = err
return v, true
}
// Build inner join above the aggregation.
join := logicalop.LogicalJoin{JoinType: logicalop.InnerJoin}.Init(planCtx.builder.ctx, planCtx.builder.getSelectOffset())

When rewriting OR expressions (e.g., (c1 OR c1)), the kernel will convert the type of c1 to DOUBLE.

// WrapWithCastAsReal wraps `expr` with `cast` if the return type of expr is not
// type real, otherwise, returns `expr` directly.
func WrapWithCastAsReal(ctx BuildContext, expr Expression) Expression {
if expr.GetType(ctx.GetEvalCtx()).EvalType() == types.ETReal {
return expr
}
tp := types.NewFieldType(mysql.TypeDouble)
tp.SetFlen(mysql.MaxRealWidth)
tp.SetDecimal(types.UnspecifiedLength)
types.SetBinChsClnFlag(tp)
tp.AddFlag(expr.GetType(ctx.GetEvalCtx()).GetFlag() & (mysql.UnsignedFlag | mysql.NotNullFlag))
return BuildCastFunction(ctx, expr, tp)
}

The join's condition is (c1 OR c1) = c1.

We can check the output of the join condition for the left and right tables after converting INNER JOIN.

left output:

tidb> select (c1 or c1) from t1;
+------------+
| (c1 or c1) |
+------------+
|          0 |
|          0 |
|          0 |
|          0 |
|          0 |
+------------+
5 rows in set, 10 warnings (0.00 sec)

right output:

tidb> SELECT cast(c1 as double) FROM t1 WHERE (c1 <= (0x991D3FA2F9C)) group by c1;
+--------------------+
| cast(c1 as double) |
+--------------------+
|                  0 |
|                  0 |
|                  0 |
+--------------------+
3 rows in set, 3 warnings (0.01 sec)

tidb> 

After the DISTINCT operation, CAST(c1 AS DOUBLE) actually causes data duplication, which will lead to unexpected results for in operation.

To address this, I add a restriction who will check if the left expression is or function in the conditions for converting to an INNER JOIN.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner labels Nov 21, 2024
Copy link

ti-chi-bot bot commented Nov 21, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign winoros for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 21, 2024
Copy link

ti-chi-bot bot commented Nov 21, 2024

Hi @dash12653. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

tiprow bot commented Nov 21, 2024

Hi @dash12653. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hawkingrei
Copy link
Member

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Nov 21, 2024
Copy link

codecov bot commented Nov 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.3545%. Comparing base (91c14a4) to head (35e8289).
Report is 3 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #57584        +/-   ##
================================================
+ Coverage   72.8590%   73.3545%   +0.4955%     
================================================
  Files          1676       1676                
  Lines        463653     463865       +212     
================================================
+ Hits         337813     340266      +2453     
+ Misses       104979     102861      -2118     
+ Partials      20861      20738       -123     
Flag Coverage Δ
integration 43.4425% <50.0000%> (?)
unit 72.1824% <100.0000%> (-0.0740%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.7673% <ø> (ø)
parser ∅ <ø> (∅)
br 45.4413% <ø> (-0.0091%) ⬇️
---- 🚨 Try these New Features:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/needs-triage-completed ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate Rows in Query with Binary Condition
2 participants