Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logging: log time taken for tablet initialization only once #14597

Merged
merged 1 commit into from
Nov 27, 2023

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Nov 24, 2023

Description

Time taken for tablet init should be logged exactly once, and not every time the tablet goes through a state transition.
The way I have chosen to make this happen is by using sync.Once. However it is open to debate whether the log line should actually be moved to some other location in the code to ensure that it is called exactly once.

With this change, even after running a PRS, no new log line appears of the form Tablet Init took 17899ms.

Related Issue(s)

Fixes #14596

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Nov 24, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Nov 24, 2023
@deepthi deepthi requested review from GuptaManan100 and removed request for systay, shlomi-noach and harshit-gangal November 24, 2023 02:19
@github-actions github-actions bot added this to the v19.0.0 milestone Nov 24, 2023
@deepthi deepthi added Component: TabletManager Type: Internal Cleanup and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Nov 24, 2023
@deepthi
Copy link
Member Author

deepthi commented Nov 24, 2023

@rohit-nayak-ps WDYT?

However it is open to debate whether the log line should actually be moved to some other location in the code to ensure that it is called exactly once.

@rohit-nayak-ps
Copy link
Contributor

So we don't want it to log for every transition, because the next few times are not actually "Init"s? Are these extra logs confusing and/or redundant information?

However it is open to debate whether the log line should actually be moved to some other location in the code to ensure that it is called exactly once.

I think this location is fine.

One thing that this location doesn't capture is the time taken forOpen()ing of all the components that happens in tmState.UpdateLocked(). iirc I couldn't find a better location which reliably reports it.

@deepthi
Copy link
Member Author

deepthi commented Nov 27, 2023

So we don't want it to log for every transition, because the next few times are not actually "Init"s? Are these extra logs confusing and/or redundant information?

Yes. While debugging an issue in PlanetScale, this log made it seem like Initialization had been hanging for a long time, when in fact it was triggered by a change in primary.

One thing that this location doesn't capture is the time taken for Open()ing of all the components that happens in tmState.UpdateLocked(). iirc I couldn't find a better location which reliably reports it.

This is the part that I was also not able to fully understand. In the Start function, we have:

	defer func() {
		log.Infof("TabletManager Start took ~%d ms", time.Since(servenv.GetInitStartTime()).Milliseconds())
	}()

I would have expected this to print the total amount of time taken until Start finishes, which includes the time taken to execute tm.tmState.Open() at the end of that function. But somehow it does not seem to include that time.

@deepthi
Copy link
Member Author

deepthi commented Nov 27, 2023

I would have expected this to print the total amount of time taken until Start finishes, which includes the time taken to execute tm.tmState.Open() at the end of that function. But somehow it does not seem to include that time.

This is because we run the example with --restore_from_backup (which defaults to true), and in that path, we do everything in a separate goroutine. This means that Start returns early and does not include time taken for the rest of initialization (like opening all the engines).
How will it be if we instead record the time taken by tmState.Open??

@deepthi
Copy link
Member Author

deepthi commented Nov 27, 2023

I tried to move the logging to tmState.Open. What happens in the example is that it takes at least 4 attempts before the "transition" is successful. Each tablet logs an error like this 3 times.

E1127 12:49:49.141542   88106 state_manager.go:286] Error transitioning to the desired state: REPLICA, Serving, will keep retrying: Unknown database 'vt_commerce' (errno 1049) (sqlstate 42000)

So what we have right now is the best we can do. When we eventually call setState we record the time taken until then.
However, we should probably revisit this as part of fixing #14562.
For now, I'm going to merge this PR in its current form.

@deepthi deepthi added Component: Observability Pull requests that touch tracing/metrics/monitoring and removed Component: Observability Pull requests that touch tracing/metrics/monitoring labels Nov 27, 2023
@deepthi deepthi merged commit 2464ab5 into vitessio:main Nov 27, 2023
120 of 121 checks passed
@deepthi deepthi deleted the ds-fix-14596 branch November 27, 2023 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: Tablet Init logs misleading time durations
4 participants