K8SPG-619: restart backup jobs on failure #969

pooknull · 2024-12-03T00:03:24Z

https://perconadev.atlassian.net/browse/K8SPG-619

DESCRIPTION

Problem:
The backup pod currently fails on the first attempt, resulting in the creation of a new pod on failure. This behavior may not be reliable in all Kubernetes environments, due to potential delays in establishing communication with the Kubernetes API.

Cause:
The backup job’s restartPolicy is set to Never, preventing the existing pod from retrying after a failure.

Solution:
Update the backup job’s restartPolicy to OnFailure and configure backoffLimit to 2. This will allow the existing pod to restart up to two times, with incremental pauses of 10 seconds and 20 seconds between retries.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PG version?
Does the change support oldest and newest supported Kubernetes version?

https://perconadev.atlassian.net/browse/K8SPG-619

JNKPercona · 2024-12-10T21:58:35Z

Test name	Status
custom-extensions	passed
custom-tls	passed
demand-backup	passed
finalizers	passed
init-deploy	passed
monitoring	passed
one-pod	passed
operator-self-healing	passed
pitr	passed
scaling	passed
scheduled-backup	passed
self-healing	passed
start-from-backup	passed
tablespaces	passed
telemetry-transfer	passed
upgrade-consistency	passed
upgrade-minor	passed
users	passed
We run 18 out of 18

commit: a50d1b0
image: perconalab/percona-postgresql-operator:PR-969-a50d1b089

pooknull added 2 commits December 3, 2024 02:02

K8SPG-619: restart backup jobs on failure

09a5bff

https://perconadev.atlassian.net/browse/K8SPG-619

fix unit-test

adb83ce

pooknull marked this pull request as ready for review December 3, 2024 12:26

pooknull requested review from hors, egegunes and inelpandzic as code owners December 3, 2024 12:26

pooknull and others added 2 commits December 3, 2024 14:26

Merge branch 'main' into K8SPG-619

d109270

Merge branch 'main' into K8SPG-619

a50d1b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPG-619: restart backup jobs on failure #969

K8SPG-619: restart backup jobs on failure #969

pooknull commented Dec 3, 2024 •

edited by jira bot

Loading

JNKPercona commented Dec 10, 2024

K8SPG-619: restart backup jobs on failure #969

Are you sure you want to change the base?

K8SPG-619: restart backup jobs on failure #969

Conversation

pooknull commented Dec 3, 2024 • edited by jira bot Loading

DESCRIPTION

CHECKLIST

JNKPercona commented Dec 10, 2024

pooknull commented Dec 3, 2024 •

edited by jira bot

Loading