K8SPXC-1482: Add waiting period after peer list update #1854

s10 · 2024-10-24T07:50:40Z

This helps to avoid acting too frequently on stale DNS resolves

CHANGE DESCRIPTION

Problem:
peer-list updates might be issued several times after database pod recreation.

Cause:
Currently, peer-list performs an infinite loop with a fixed 1-second period and watches for SRV records update
using golang net.LookupSRV function.
Unfortunately, this doesn't account for possible TTL in the K8S DNS.

Solution:
Add a 30-second waiting period after peer update.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PXC version?
Does the change support oldest and newest supported Kubernetes version?

This helps to avoid acting too frequently on stale DNS resolves

egegunes · 2024-11-11T10:18:11Z

@s10 i am having a hard time to reasoning behind 30 seconds interval. for example, if some pod goes down in this period, we'll need to wait for this to reflected. can you explain more about these changes?

s10 · 2024-11-11T13:29:21Z

@s10 i am having a hard time to reasoning behind 30 seconds interval. for example, if some pod goes down in this period, we'll need to wait for this to reflected. can you explain more about these changes?
The failure mode of this is the following:

If one pod goes down and then the second pod goes down, then yes, an update for the following pod would wait 30 seconds. I consider such behaviour a better option than giving an application 10-20 MySQL connection resets within 30 seconds.

Here is an excerpt from the linked issue, describing the problem this PR tries to ease.

If CoreDNS TTL is configured to be the 30s and the number of CoreDNS pods >1, then, after removal of the pxc pod, CoreDNS might respond with either a new SRV entry (with the deleted pod absent) or an old entry (with the deleted pod present).

This check is performed every 1 second, and stale results might present in the response and flap until the TTL expires.

Every time when peer-list receives a stale entry after previously being updated, it removes or adds pxc pod to the list of cluster members

Script haproxy_add_pxc_nodes.sh, which executes on peer-list change, updates haproxy configuration and sends a reload signal.

HAProxy reload causes existing connections to be closed after 10 seconds.

This connection reset cycle might happen several times (I sometimes got 10-20 reset after pxc-2 pod removal)
even after non-primary pod removal, degrading the experience of the database users.

egegunes · 2024-11-18T18:37:50Z

@s10 now i understand better, thank you. i agree 30 seconds sound more reasonable. but i also think if we can adjust the behavior of add_pxc_nodes.sh so the side effects are less disruptive, since people can configure TTLs up to 1 hour.

[sorry for late reply, i missed this notification.]

s10 · 2024-11-18T19:42:55Z

@s10 now i understand better, thank you. i agree 30 seconds sound more reasonable. but i also think if we can adjust the behavior of add_pxc_nodes.sh so the side effects are less disruptive, since people can configure TTLs up to 1 hour.

[sorry for late reply, i missed this notification.]

@egegunes

It would be nice to make this side effect easier. The problem is that disruption is caused by the percona/percona-docker#893 , and it couldn't be easily changed back to larger timeout without bringing back K8SPXC-1335

JNKPercona · 2024-11-25T17:43:27Z

Test name	Status
affinity-8-0	passed
auto-tuning-8-0	passed
cross-site-8-0	passed
demand-backup-cloud-8-0	failure
demand-backup-encrypted-with-tls-8-0	passed
demand-backup-8-0	passed
haproxy-5-7	passed
haproxy-8-0	passed
init-deploy-5-7	passed
init-deploy-8-0	passed
limits-8-0	passed
monitoring-2-0-8-0	passed
one-pod-5-7	passed
one-pod-8-0	passed
pitr-8-0	passed
pitr-gap-errors-8-0	passed
proxy-protocol-8-0	passed
proxysql-sidecar-res-limits-8-0	passed
pvc-resize-5-7	passed
pvc-resize-8-0	passed
recreate-8-0	passed
restore-to-encrypted-cluster-8-0	passed
scaling-proxysql-8-0	passed
scaling-8-0	passed
scheduled-backup-5-7	passed
scheduled-backup-8-0	failure
security-context-8-0	passed
smart-update1-8-0	passed
smart-update2-8-0	passed
storage-8-0	passed
tls-issue-cert-manager-ref-8-0	passed
tls-issue-cert-manager-8-0	passed
tls-issue-self-8-0	passed
upgrade-consistency-8-0	passed
upgrade-haproxy-5-7	passed
upgrade-haproxy-8-0	passed
upgrade-proxysql-5-7	passed
upgrade-proxysql-8-0	passed
users-5-7	failure
users-8-0	passed
validation-hook-8-0	passed
We run 41 out of 41

commit: 78570cd
image: perconalab/percona-xtradb-cluster-operator:PR-1854-78570cd4

K8SPXC-1482 Add waiting period after peer list update

78570cd

This helps to avoid acting too frequently on stale DNS resolves

s10 requested review from hors, egegunes, inelpandzic and pooknull as code owners October 24, 2024 07:50

pull-request-size bot added the size/S 10-29 lines label Oct 24, 2024

s10 changed the title ~~K8SPXC-1482 Add waiting period after peer list update~~ K8SPXC-1482: Add waiting period after peer list update Oct 24, 2024

egegunes self-assigned this Oct 28, 2024

egegunes added the community label Oct 28, 2024

egegunes added this to the v1.16.0 milestone Nov 18, 2024

hors modified the milestones: v1.16.0, v1.17.0 Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPXC-1482: Add waiting period after peer list update #1854

K8SPXC-1482: Add waiting period after peer list update #1854

s10 commented Oct 24, 2024 •

edited by jira bot

Loading

egegunes commented Nov 11, 2024

s10 commented Nov 11, 2024 •

edited

Loading

egegunes commented Nov 18, 2024

s10 commented Nov 18, 2024 •

edited

Loading

JNKPercona commented Nov 25, 2024

K8SPXC-1482: Add waiting period after peer list update #1854

Are you sure you want to change the base?

K8SPXC-1482: Add waiting period after peer list update #1854

Conversation

s10 commented Oct 24, 2024 • edited by jira bot Loading

CHANGE DESCRIPTION

CHECKLIST

egegunes commented Nov 11, 2024

s10 commented Nov 11, 2024 • edited Loading

egegunes commented Nov 18, 2024

s10 commented Nov 18, 2024 • edited Loading

JNKPercona commented Nov 25, 2024

s10 commented Oct 24, 2024 •

edited by jira bot

Loading

s10 commented Nov 11, 2024 •

edited

Loading

s10 commented Nov 18, 2024 •

edited

Loading