Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: dynamically adjusting the max write speed of reorganization job #57611

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

fzzf678
Copy link
Contributor

@fzzf678 fzzf678 commented Nov 21, 2024

What problem does this PR solve?

Issue Number: ref #57526 #57229

Problem Summary: See #57229

What changed and how does it work?

Supply later

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. sig/planner SIG: Planner size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 21, 2024
Copy link

tiprow bot commented Nov 21, 2024

Hi @fzzf678. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@fzzf678
Copy link
Contributor Author

fzzf678 commented Nov 21, 2024

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Nov 21, 2024
@fzzf678
Copy link
Contributor Author

fzzf678 commented Nov 21, 2024

/test all

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 66.32653% with 33 lines in your changes missing coverage. Please review.

Project coverage is 74.1107%. Comparing base (6e22b8c) to head (3173317).
Report is 6 commits behind head on master.

Current head 3173317 differs from pull request most recent head a1cc392

Please upload reports for the commit a1cc392 to get more accurate results.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #57611        +/-   ##
================================================
+ Coverage   72.7907%   74.1107%   +1.3199%     
================================================
  Files          1676       1721        +45     
  Lines        463750     471832      +8082     
================================================
+ Hits         337567     349678     +12111     
+ Misses       105323      99849      -5474     
- Partials      20860      22305      +1445     
Flag Coverage Δ
integration 46.8841% <48.8888%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.7673% <ø> (ø)
parser ∅ <ø> (∅)
br 60.9516% <ø> (+15.5198%) ⬆️
---- 🚨 Try these New Features:

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Nov 22, 2024
@@ -75,8 +75,9 @@ type DDLReorgMeta struct {
// These two variables are used to control the concurrency and batch size of the reorganization process.
// They can be adjusted dynamically through `admin alter ddl jobs` command.
// Note: Don't get or set these two variables directly, use the functions instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make these 3 var private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will try make these vars private in next PR

Copy link
Member

@CbcWestwolf CbcWestwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

pkg/meta/model/reorg.go Outdated Show resolved Hide resolved
Copy link

ti-chi-bot bot commented Nov 22, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CbcWestwolf, winoros
Once this PR has been reviewed and has the lgtm label, please assign gmhdbjd for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 22, 2024
Copy link

ti-chi-bot bot commented Nov 22, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-11-22 06:03:08.187779729 +0000 UTC m=+184375.807434246: ☑️ agreed by winoros.
  • 2024-11-22 06:33:40.782162927 +0000 UTC m=+186208.401817441: ☑️ agreed by CbcWestwolf.

@fzzf678
Copy link
Contributor Author

fzzf678 commented Nov 22, 2024

/cc @lance6716, please help take a look

pkg/ddl/backfilling.go Outdated Show resolved Hide resolved
pkg/lightning/backend/local/localhelper.go Outdated Show resolved Hide resolved
pkg/lightning/backend/local/localhelper.go Outdated Show resolved Hide resolved
pkg/lightning/backend/local/region_job.go Outdated Show resolved Hide resolved
pkg/lightning/backend/local/region_job.go Outdated Show resolved Hide resolved
// GetMaxWriteSpeedOrDefault gets the max write speed from DDLReorgMeta.
// 0 means no limit.
func (dm *DDLReorgMeta) GetMaxWriteSpeedOrDefault(defaultVal int) int {
if dm == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be handled in where it's unmarshalled in a unified way, not check in every method

if dm == nil {
return defaultVal
}
return int(atomic.LoadInt64(&dm.MaxWriteSpeed))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use atomic.XXX type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the reason is atomic.XXX does not implement (un)marshal interface, so we must store the plain type in struct and use helper functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

golang's doesn't, but uber's does

Copy link
Contributor Author

@fzzf678 fzzf678 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the atomic.XXX type, seems these three fields wiil meet unmarshal error when the job meta is from old version(upgrade)? Seems not, store int64 in the under layer

// MarshalJSON encodes the wrapped int64 into JSON.
func (i *Int64) MarshalJSON() ([]byte, error) {
	return json.Marshal(i.Load())
}

// UnmarshalJSON decodes JSON into the wrapped int64.
func (i *Int64) UnmarshalJSON(b []byte) error {
	var v int64
	if err := json.Unmarshal(b, &v); err != nil {
		return err
	}
	i.Store(v)
	return nil
}

pkg/executor/operate_ddl_jobs.go Outdated Show resolved Hide resolved
pkg/lightning/backend/local/localhelper_test.go Outdated Show resolved Hide resolved
pkg/lightning/backend/local/localhelper.go Outdated Show resolved Hide resolved
@@ -546,6 +546,9 @@ func (job *Job) Encode(updateRawArgs bool) ([]byte, error) {
// decode special args for this job.
func (job *Job) Decode(b []byte) error {
err := json.Unmarshal(b, job)
if job.MayNeedReorg() && job.ReorgMeta == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MayNeedReorg depends on a context var set at DDL executor, when job worker run it, it's not there

maybe move it to where job.ReorgMeta is used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update in a128426 PTAL again

@D3Hunter
Copy link
Contributor

rest lgtm

pkg/planner/core/planbuilder_test.go Outdated Show resolved Hide resolved
pkg/planner/core/planbuilder_test.go Outdated Show resolved Hide resolved
pkg/planner/core/planbuilder_test.go Outdated Show resolved Hide resolved
pkg/planner/core/planbuilder.go Outdated Show resolved Hide resolved
speed int64
err error
)
v := opt.Value.(*expression.Constant)
Copy link
Contributor

@lance6716 lance6716 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add unit tests to make sure it will always be a constant. For example, ALTER ... max_write_speed = RAND() should fail with friendly error message

case types.ETInt:
speed = v.Value.GetInt64()
default:
return 0, fmt.Errorf("the value %s for %s is invalid", opt.Name, opt.Value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a unit test to check the error message is readable in this case.

speedStr := v.Value.GetString()
speed, err = units.RAMInBytes(speedStr)
if err != nil {
return 0, errors.Trace(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check error mesage is readable when failed to parse

gotTokens += n
}
elapsed := time.Since(start)
maxTokens := 120 + int(float64(elapsed.Seconds())*float64(maxT))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elapsed.Seconds() is already float64. Your IDE should notice this for you 🤔

elapsed := time.Since(start)
maxTokens := 120 + int(float64(elapsed.Seconds())*float64(maxT))
// In theory, gotTokens should be less than or equal to maxT.
// But we allow a little of error to avoid the test being flaky.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you find this test is flaky? I think the sleep-related functions guarantee that at least some duration it will block. So the gotTokens should always be less or equal than maxT * elapsed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see it's because out limiter "Allow burst of at most 20% of the writeLimit." We can change hardcoded 120 to 1.2*maxT then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update in a1cc392

@@ -1195,6 +1195,17 @@ func TestAdminAlterDDLJobUnsupportedCases(t *testing.T) {
tk.MustGetErrMsg("admin alter ddl jobs 1 thread = 257;", "the value 257 for thread is out of range [1, 256]")
tk.MustGetErrMsg("admin alter ddl jobs 1 batch_size = 31;", "the value 31 for batch_size is out of range [32, 10240]")
tk.MustGetErrMsg("admin alter ddl jobs 1 batch_size = 10241;", "the value 10241 for batch_size is out of range [32, 10240]")
tk.MustGetErrMsg("admin alter ddl jobs 1 max_write_speed = '2PiB';", "the value 2PiB for max_write_speed is out of range [0, 1125899906842624]")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some error message cc @lance6716

Copy link
Contributor

@lance6716 lance6716 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negative tests cases are not enough. Please check at least max_write_speed = 1.23, max_write_speed = RAND(), max_write_speed = 30+40, max_write_speed = 'asdasd', and maybe ask other reviewers to come up with more cases.

Copy link
Contributor Author

@fzzf678 fzzf678 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some cases and fix comment in d71a826 PTAL again

@pingcap pingcap deleted a comment from ti-chi-bot bot Nov 25, 2024
@pingcap pingcap deleted a comment from tiprow bot Nov 25, 2024
Copy link

tiprow bot commented Nov 25, 2024

@fzzf678: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow a1cc392 link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants