Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential error rate regression during TiKV store leader transfer #59239

Open
HaoW30 opened this issue Feb 3, 2025 · 0 comments
Open

Potential error rate regression during TiKV store leader transfer #59239

HaoW30 opened this issue Feb 3, 2025 · 0 comments
Labels
component/pd type/bug The issue is confirmed as a bug.

Comments

@HaoW30
Copy link
Contributor

HaoW30 commented Feb 3, 2025

Bug Report

Due to the general implementation design of tidb query max_execution_time, some queries can still complete even after exceeding the specified limit. However, after the recent change introduced in #56923, we observed an increased query error rate during TiKV store leader transfers when using strict max_execution_time(subsecond, like 500ms).

1. Minimal reproduce step (Required)

  1. Simulate a scenario where a TiKV node experiences EBS latency issues, triggering leader transfers away from this node.
  2. When TiDB attempts to read from regions undergoing leader transfer, it encounters notLeader errors without receiving new leader information.
  3. Before #56923:
  • max_execution_time was not propagated to the backoff context.
  • The TiKV client could perform multiple retries (ref) to eventually locate the new leader(or try follower) and complete the query, even if it exceeded max_execution_time.
  1. After #56923:
  • max_execution_time is now propagated to the backoff context.
  • The request gets canceled once the backoff detects that the context timeout(or other func like s.client.SendRequest detects the context timeout) has been reached (ref), resulting in more query failures during leader transfers.

2. What did you expect to see? (Required)

See below

3. What did you see instead (Required)

While it's hard to say this is a real "bug," the stricter enforcement of max_execution_time has led to a noticeable increase in query errors during TiKV leader transfers. This behavioral change is significant and worth attention, as it affects query reliability under certain failure scenarios.

4. What is your TiDB version? (Required)

v6.5.4, but this issue very likely applies to later versions as well.

@HaoW30 HaoW30 added the type/bug The issue is confirmed as a bug. label Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/pd type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

2 participants