feat: pre-check the BackupRepo by running a real job #5714

zjx20 · 2023-11-01T07:14:25Z

Currently, if the user accidentally fills in a wrong configuration, such as an invalid access key, KubeBlocks can only discover this error when executing a backup task.

This PR adds a pre-check for BackupRepo, which can expose such configuration problems in advance. BackupRepo enters the PreChecking state as soon as it is created, and a job is started to try to access the storage corresponding to BackupRepo. If the access is successful, the status of BackupRepo is set to Ready; otherwise, the status is set to Failed, and the logs and events of the job are collected and saved for debugging.

If the remote storage is accessed based on CSI driver, a configuration error will cause the PVC to fail to provision, resulting in the job being always in a running state. To address this problem, a timeout mechanism is introduced. If the job does not end within 15 minutes, the pre-check is considered to have failed (and the logs and events are also collected at this time).

Here is an example:

# Create a S3 based BackupRepo with the wrong credential
$ kbcli backuprepo create --provider s3 --region us-west-1 --bucket my-bucket --access-key-id BADDDD --secret-access-key WHATEVERRRRR --access-method Tool


# Monitor the status, it will become 'Failed' after a few minutes
$ kubectl get backuprepo -w
NAME               STATUS        STORAGEPROVIDER   ACCESSMETHOD   DEFAULT   AGE
backuprepo-xhbwr   PreChecking   s3                Tool                    4s
backuprepo-xhbwr   PreChecking   s3                Tool                    3m25s
backuprepo-xhbwr   Failed        s3                Tool                    3m25s


# Check the collected information, then we know it might be a configuration error
$ kubectl get backuprepo backuprepo-xhbwr -o jsonpath='{.status.conditions[?(@.type == "PreCheckPassed")].message}'
Pre-check job failed, information collected for diagnosis.

Job failure message: BackoffLimitExceeded:Job has reached the specified backoff limit

Logs from the pre-check job:
  + export 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin/datasafed'
  + echo pre-check
  + datasafed push - /precheck.txt
  2023/11/01 06:09:00 ERROR : precheck.txt: Failed to copy: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.
  	status code: 403, request id: G30M0QPAKZY483W3, host id: nL8pYwHck3ab7+53+9OrEIXReHAYTHbKUc1gSb5UZSsF6kU2I2BDO3XGQsasXoV933O64k9/fXA=
  Error: push to "/precheck.txt": SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.
  	status code: 403, request id: G30M0QPAKZY483W3, host id: nL8pYwHck3ab7+53+9OrEIXReHAYTHbKUc1gSb5UZSsF6kU2I2BDO3XGQsasXoV933O64k9/fXA=

Events from Pod/default/pre-check-23989c99-backuprepo-xhbwr-4c2dd:
  Type	Reason	Age	From	Message
  ----	------	----	----	-------
  Normal	Pulling	29s	kubelet	Pulling image "apecloud/datasafed:latest"
  Normal	Scheduled	29s	default-scheduler	Successfully assigned default/pre-check-23989c99-backuprepo-xhbwr-4c2dd to minikube
  Normal	Pulled	26s	kubelet	Successfully pulled image "apecloud/datasafed:latest" in 2.735136877s (2.735152209s including waiting)
  Normal	Created	26s	kubelet	Created container dp-copy-datasafed
  Normal	Started	26s	kubelet	Started container dp-copy-datasafed
  Normal	Pulling	25s	kubelet	Pulling image "apecloud/kubeblocks-tools:latest"
  Normal	Pulled	23s	kubelet	Successfully pulled image "apecloud/kubeblocks-tools:latest" in 2.263616251s (2.263645084s including waiting)
  Normal	Created	23s	kubelet	Created container pre-check
  Normal	Started	23s	kubelet	Started container pre-check

Events from Job/default/pre-check-23989c99-backuprepo-xhbwr:
  Type	Reason	Age	From	Message
  ----	------	----	----	-------
  Normal	SuccessfulCreate	60s	job-controller	Created pod: pre-check-23989c99-backuprepo-xhbwr-rvvqw
  Normal	SuccessfulCreate	50s	job-controller	Created pod: pre-check-23989c99-backuprepo-xhbwr-jkspj
  Normal	SuccessfulCreate	29s	job-controller	Created pod: pre-check-23989c99-backuprepo-xhbwr-4c2dd
  Warning	BackoffLimitExceeded	0s	job-controller	Job has reached the specified backoff limit

Close #5571.

codecov · 2023-11-01T07:14:47Z

Codecov Report

Attention: 160 lines in your changes are missing coverage. Please review.

Comparison is base (8f6e37f) 70.54% compared to head (6bc75d6) 70.47%.
Report is 21 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5714      +/-   ##
==========================================
- Coverage   70.54%   70.47%   -0.07%     
==========================================
  Files         273      274       +1     
  Lines       31171    31551     +380     
==========================================
+ Hits        21990    22237     +247     
- Misses       7395     7501     +106     
- Partials     1786     1813      +27

Flag	Coverage Δ
unittests	`70.47% <69.92%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
controllers/dataprotection/types.go	`100.00% <ø> (ø)`
pkg/dataprotection/utils/utils.go	`0.00% <0.00%> (ø)`
pkg/dataprotection/utils/events.go	`0.00% <0.00%> (ø)`
...ontrollers/dataprotection/backuprepo_controller.go	`78.96% <71.95%> (-5.83%)`	⬇️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ldming · 2023-11-02T03:33:55Z

controllers/dataprotection/backuprepo_controller.go

+	finished, jobStatus, failureReason := utils.IsJobFinished(job)
+	if !finished {
+		duration := wallClock.Since(job.CreationTimestamp.Time)
+		if duration > defaultPreCheckTimeout {


Can we use the job's activeDeadlineSeconds filed? activeDeadlineSeconds

Job and Pod both have activeDeadlineSeconds, but neither of them is suitable for our scenario. If job.spec.activeDeadlineSeconds is set, when the run times out, the job controller will delete the running pods directly to stop them; since the pods are deleted, we may not have time to collect the error logs.

In the meantime, pod.spec.activeDeadlineSeconds may fail in some cases. As mentioned before, when the configuration is wrong, the PVC provisioning will fail, which makes the pod get stuck in the Pending state, but activeDeadlineSeconds seems to start counting from the Running state, so the pod will not fail due to timeout.

Ok, It is recommended to add this description to the code comments.

zjx20 added 4 commits November 1, 2023 14:22

feat: pre-check the backup repository by running a real pod

a907372

add test cases

9d8eded

collect logs and events for failed job

435faae

fix linting

b39d094

zjx20 requested review from nayutah, ldming, heng4fun, free6om, wangyelei and leon-inf as code owners November 1, 2023 07:14

apecloud-bot added the feature label Nov 1, 2023

github-actions bot added the size/XXL Denotes a PR that changes 1000+ lines. label Nov 1, 2023

zjx20 linked an issue Nov 1, 2023 that may be closed by this pull request

[Features] pre-check object storage if it is accessible in backupRepo #5571

Closed

apecloud-bot added the area/user-interaction label Nov 1, 2023

apecloud-bot requested a review from realzyy November 1, 2023 07:17

nayutah approved these changes Nov 1, 2023

View reviewed changes

ldming reviewed Nov 2, 2023

View reviewed changes

add comments about activeDeadlineSeconds

6bc75d6

ldming approved these changes Nov 3, 2023

View reviewed changes

zjx20 merged commit fc16ed9 into main Nov 3, 2023
58 checks passed

zjx20 deleted the feature/precheck-backuprepo branch November 3, 2023 08:01

github-actions bot added this to the Release 0.7.0 milestone Nov 3, 2023

zjx20 removed this from the Release 0.7.0 milestone Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pre-check the BackupRepo by running a real job #5714

feat: pre-check the BackupRepo by running a real job #5714

zjx20 commented Nov 1, 2023 •

edited

Loading

codecov bot commented Nov 1, 2023 •

edited

Loading

ldming Nov 2, 2023

zjx20 Nov 2, 2023

ldming Nov 3, 2023

zjx20 Nov 3, 2023

feat: pre-check the BackupRepo by running a real job #5714

feat: pre-check the BackupRepo by running a real job #5714

Conversation

zjx20 commented Nov 1, 2023 • edited Loading

codecov bot commented Nov 1, 2023 • edited Loading

Codecov Report

ldming Nov 2, 2023

Choose a reason for hiding this comment

zjx20 Nov 2, 2023

Choose a reason for hiding this comment

ldming Nov 3, 2023

Choose a reason for hiding this comment

zjx20 Nov 3, 2023

Choose a reason for hiding this comment

zjx20 commented Nov 1, 2023 •

edited

Loading

codecov bot commented Nov 1, 2023 •

edited

Loading