Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: pre-check the BackupRepo by running a real job #5714

Merged
merged 5 commits into from
Nov 3, 2023

Conversation

zjx20
Copy link
Contributor

@zjx20 zjx20 commented Nov 1, 2023

Currently, if the user accidentally fills in a wrong configuration, such as an invalid access key, KubeBlocks can only discover this error when executing a backup task.

This PR adds a pre-check for BackupRepo, which can expose such configuration problems in advance. BackupRepo enters the PreChecking state as soon as it is created, and a job is started to try to access the storage corresponding to BackupRepo. If the access is successful, the status of BackupRepo is set to Ready; otherwise, the status is set to Failed, and the logs and events of the job are collected and saved for debugging.

If the remote storage is accessed based on CSI driver, a configuration error will cause the PVC to fail to provision, resulting in the job being always in a running state. To address this problem, a timeout mechanism is introduced. If the job does not end within 15 minutes, the pre-check is considered to have failed (and the logs and events are also collected at this time).

Here is an example:

# Create a S3 based BackupRepo with the wrong credential
$ kbcli backuprepo create --provider s3 --region us-west-1 --bucket my-bucket --access-key-id BADDDD --secret-access-key WHATEVERRRRR --access-method Tool


# Monitor the status, it will become 'Failed' after a few minutes
$ kubectl get backuprepo -w
NAME               STATUS        STORAGEPROVIDER   ACCESSMETHOD   DEFAULT   AGE
backuprepo-xhbwr   PreChecking   s3                Tool                    4s
backuprepo-xhbwr   PreChecking   s3                Tool                    3m25s
backuprepo-xhbwr   Failed        s3                Tool                    3m25s


# Check the collected information, then we know it might be a configuration error
$ kubectl get backuprepo backuprepo-xhbwr -o jsonpath='{.status.conditions[?(@.type == "PreCheckPassed")].message}'
Pre-check job failed, information collected for diagnosis.

Job failure message: BackoffLimitExceeded:Job has reached the specified backoff limit

Logs from the pre-check job:
  + export 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin/datasafed'
  + echo pre-check
  + datasafed push - /precheck.txt
  2023/11/01 06:09:00 ERROR : precheck.txt: Failed to copy: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.
  	status code: 403, request id: G30M0QPAKZY483W3, host id: nL8pYwHck3ab7+53+9OrEIXReHAYTHbKUc1gSb5UZSsF6kU2I2BDO3XGQsasXoV933O64k9/fXA=
  Error: push to "/precheck.txt": SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.
  	status code: 403, request id: G30M0QPAKZY483W3, host id: nL8pYwHck3ab7+53+9OrEIXReHAYTHbKUc1gSb5UZSsF6kU2I2BDO3XGQsasXoV933O64k9/fXA=

Events from Pod/default/pre-check-23989c99-backuprepo-xhbwr-4c2dd:
  Type	Reason	Age	From	Message
  ----	------	----	----	-------
  Normal	Pulling	29s	kubelet	Pulling image "apecloud/datasafed:latest"
  Normal	Scheduled	29s	default-scheduler	Successfully assigned default/pre-check-23989c99-backuprepo-xhbwr-4c2dd to minikube
  Normal	Pulled	26s	kubelet	Successfully pulled image "apecloud/datasafed:latest" in 2.735136877s (2.735152209s including waiting)
  Normal	Created	26s	kubelet	Created container dp-copy-datasafed
  Normal	Started	26s	kubelet	Started container dp-copy-datasafed
  Normal	Pulling	25s	kubelet	Pulling image "apecloud/kubeblocks-tools:latest"
  Normal	Pulled	23s	kubelet	Successfully pulled image "apecloud/kubeblocks-tools:latest" in 2.263616251s (2.263645084s including waiting)
  Normal	Created	23s	kubelet	Created container pre-check
  Normal	Started	23s	kubelet	Started container pre-check

Events from Job/default/pre-check-23989c99-backuprepo-xhbwr:
  Type	Reason	Age	From	Message
  ----	------	----	----	-------
  Normal	SuccessfulCreate	60s	job-controller	Created pod: pre-check-23989c99-backuprepo-xhbwr-rvvqw
  Normal	SuccessfulCreate	50s	job-controller	Created pod: pre-check-23989c99-backuprepo-xhbwr-jkspj
  Normal	SuccessfulCreate	29s	job-controller	Created pod: pre-check-23989c99-backuprepo-xhbwr-4c2dd
  Warning	BackoffLimitExceeded	0s	job-controller	Job has reached the specified backoff limit

Close #5571.

Copy link

codecov bot commented Nov 1, 2023

Codecov Report

Attention: 160 lines in your changes are missing coverage. Please review.

Comparison is base (8f6e37f) 70.54% compared to head (6bc75d6) 70.47%.
Report is 21 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5714      +/-   ##
==========================================
- Coverage   70.54%   70.47%   -0.07%     
==========================================
  Files         273      274       +1     
  Lines       31171    31551     +380     
==========================================
+ Hits        21990    22237     +247     
- Misses       7395     7501     +106     
- Partials     1786     1813      +27     
Flag Coverage Δ
unittests 70.47% <69.92%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
controllers/dataprotection/types.go 100.00% <ø> (ø)
pkg/dataprotection/utils/utils.go 0.00% <0.00%> (ø)
pkg/dataprotection/utils/events.go 0.00% <0.00%> (ø)
...ontrollers/dataprotection/backuprepo_controller.go 78.96% <71.95%> (-5.83%) ⬇️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the size/XXL Denotes a PR that changes 1000+ lines. label Nov 1, 2023
@zjx20 zjx20 linked an issue Nov 1, 2023 that may be closed by this pull request
@apecloud-bot apecloud-bot requested a review from realzyy November 1, 2023 07:17
finished, jobStatus, failureReason := utils.IsJobFinished(job)
if !finished {
duration := wallClock.Since(job.CreationTimestamp.Time)
if duration > defaultPreCheckTimeout {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the job's activeDeadlineSeconds filed? activeDeadlineSeconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job and Pod both have activeDeadlineSeconds, but neither of them is suitable for our scenario. If job.spec.activeDeadlineSeconds is set, when the run times out, the job controller will delete the running pods directly to stop them; since the pods are deleted, we may not have time to collect the error logs.

In the meantime, pod.spec.activeDeadlineSeconds may fail in some cases. As mentioned before, when the configuration is wrong, the PVC provisioning will fail, which makes the pod get stuck in the Pending state, but activeDeadlineSeconds seems to start counting from the Running state, so the pod will not fail due to timeout.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, It is recommended to add this description to the code comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@zjx20 zjx20 merged commit fc16ed9 into main Nov 3, 2023
58 checks passed
@zjx20 zjx20 deleted the feature/precheck-backuprepo branch November 3, 2023 08:01
@github-actions github-actions bot added this to the Release 0.7.0 milestone Nov 3, 2023
@zjx20 zjx20 removed this from the Release 0.7.0 milestone Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-interaction feature size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Features] pre-check object storage if it is accessible in backupRepo
4 participants