diff --git a/runbooks/source/disaster-recovery-scenarios.html.md.erb b/runbooks/source/disaster-recovery-scenarios.html.md.erb index 7db9741e..2f2e791b 100644 --- a/runbooks/source/disaster-recovery-scenarios.html.md.erb +++ b/runbooks/source/disaster-recovery-scenarios.html.md.erb @@ -1,8 +1,8 @@ --- title: Cloud Platform Disaster Recovery Scenarios weight: 91 -last_reviewed_on: 2023-11-20 -review_in: 3 months +last_reviewed_on: 2024-05-20 +review_in: 6 months --- # Cloud Platform Disaster Recovery Scenarios @@ -266,3 +266,25 @@ terraform plan -target=module.starter_pack No changes. Infrastructure is up-to-date. ``` + +### Resolving a PartiallyFailed backup alert + +A backup may fail and trigger an alert in `lower-priority-alerts`. inspect the backup job: + +``` +kubectl get backup -n velero | grep -C 30 YYYYMMDD +``` + +You identify the failed backup `phase: PartiallyFailed` and there will also by an `errors` field with a count. + +To understand the cause of the alert pull out the error messages from the velero pod from kibana: + +``` +kubernetes.pod_name: velero- and log: "level=error" +``` + +Sometimes the cause of the alert can be genuine, for instance a volume may have been removed (pod restart during a backup): + +``` +level=error msg="Error backing up item" backup=velero/velero-allnamespacebackup-20231120090023 error="error getting volume info: rpc error: code = Unknown desc = InvalidVolume.NotFound: The volume 'vol-08d317558ab5bd46b' does not exist.\n\tstatus code: 400 +``` diff --git a/runbooks/source/tips-and-tricks.html.md.erb b/runbooks/source/tips-and-tricks.html.md.erb index bd4aaf2f..2a2c474a 100644 --- a/runbooks/source/tips-and-tricks.html.md.erb +++ b/runbooks/source/tips-and-tricks.html.md.erb @@ -1,8 +1,8 @@ --- title: Tips and Tricks weight: 9200 -last_reviewed_on: 2023-10-03 -review_in: 3 months +last_reviewed_on: 2023-05-06 +review_in: 6 months --- # Tips and Tricks