Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SPSMDB-1249: Fix smart update for pods that are not member of replset #1781

Merged
merged 13 commits into from
Jan 14, 2025

Conversation

egegunes
Copy link
Contributor

@egegunes egegunes commented Jan 10, 2025

K8SPSMDB-1249 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem:
Operator can not finish smart update if update is triggered before replset is initialized.

Steps to reproduce:

  1. Create a new cluster.
  2. Immediately patch cluster to trigger smart update.
  3. Operator detects the change but waits until all pods are up to start smart update.
  4. Wait until all pods are up and running.
  5. Operator initializes replset in cluster1-cfg-0 and creates users there. As of now, replset has only one member: cluster1-cfg-0.
  6. Once replset is initialized, operator starts smart update.
  7. Smart update starts from last pod which is cluster1-cfg-2.
  8. operator tries to connect to pod to check if it's primary or not but fails to connect because cluster1-cfg-2 is not added to replset yet.

Solution:
Maintain a map of members in cluster status and check if pod is a member of replset using this map. If pod is not a member, operator doesn't need to check if pod is primary and just update it.

Warning

Arbiter members won't show up in members map in status. It's because we use pod names as keys to map and we get this information from member tags in replset configuration but arbiters are not allowed to have tags.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/L 100-499 lines label Jan 10, 2025
@egegunes egegunes marked this pull request as ready for review January 10, 2025 15:52
@egegunes egegunes requested review from gkech and removed request for inelpandzic January 10, 2025 15:52
membersLive := 0
for _, member := range rsMembers {
switch member.State {
case mongo.MemberStatePrimary, mongo.MemberStateSecondary, mongo.MemberStateArbiter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Here we can also include the default case with a continue action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any practical difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for better readability and clarity since unhandled cases are explicitly handled in a way (through the default). The behaviour remains the same!

Comment on lines 525 to 528
podName, ok := tags["podName"]
if !ok {
continue
}
Copy link
Contributor

@gkech gkech Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write this in a more compact way if we want given that podName is not used further, like this:

if podName, ok := tags["podName"]; ok {
    rsMembers[podName] = api.ReplsetMemberStatus{
        Name:     member.Name,
        State:    member.State,
        StateStr: member.StateStr,
    }
}

it is also more consistent with the style we are using in the same function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egegunes egegunes requested a review from gkech January 11, 2025 13:10
gkech
gkech previously approved these changes Jan 13, 2025
pooknull
pooknull previously approved these changes Jan 13, 2025
@egegunes egegunes dismissed stale reviews from pooknull and gkech via f2f512f January 13, 2025 12:35
gkech
gkech previously approved these changes Jan 13, 2025
hors
hors previously approved these changes Jan 13, 2025
@JNKPercona
Copy link
Collaborator

Test name Status
arbiter passed
balancer passed
custom-replset-name passed
custom-tls passed
custom-users-roles passed
custom-users-roles-sharded passed
cross-site-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-fs passed
demand-backup-eks-credentials-irsa passed
demand-backup-physical passed
demand-backup-physical-sharded passed
demand-backup-sharded passed
expose-sharded passed
ignore-labels-annotations passed
init-deploy passed
finalizer passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
multi-cluster-service passed
non-voting passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-sharded passed
pitr-physical passed
preinit-updates passed
pvc-resize passed
recover-no-primary passed
replset-overrides passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
stable-resource-version passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 55 out of 55

commit: 19b115d
image: perconalab/percona-server-mongodb-operator:PR-1781-19b115db

@hors hors self-requested a review January 14, 2025 12:33
@hors hors requested review from gkech and pooknull January 14, 2025 15:18
@hors hors merged commit afc19ed into main Jan 14, 2025
13 checks passed
@hors hors deleted the K8SPSMDB-1249 branch January 14, 2025 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L 100-499 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants