-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: pre-check the BackupRepo by running a real job #5714
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #5714 +/- ##
==========================================
- Coverage 70.54% 70.47% -0.07%
==========================================
Files 273 274 +1
Lines 31171 31551 +380
==========================================
+ Hits 21990 22237 +247
- Misses 7395 7501 +106
- Partials 1786 1813 +27
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
finished, jobStatus, failureReason := utils.IsJobFinished(job) | ||
if !finished { | ||
duration := wallClock.Since(job.CreationTimestamp.Time) | ||
if duration > defaultPreCheckTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the job's activeDeadlineSeconds
filed? activeDeadlineSeconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job and Pod both have activeDeadlineSeconds, but neither of them is suitable for our scenario. If job.spec.activeDeadlineSeconds
is set, when the run times out, the job controller will delete the running pods directly to stop them; since the pods are deleted, we may not have time to collect the error logs.
In the meantime, pod.spec.activeDeadlineSeconds
may fail in some cases. As mentioned before, when the configuration is wrong, the PVC provisioning will fail, which makes the pod get stuck in the Pending
state, but activeDeadlineSeconds seems to start counting from the Running
state, so the pod will not fail due to timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, It is recommended to add this description to the code comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Currently, if the user accidentally fills in a wrong configuration, such as an invalid access key, KubeBlocks can only discover this error when executing a backup task.
This PR adds a pre-check for BackupRepo, which can expose such configuration problems in advance. BackupRepo enters the
PreChecking
state as soon as it is created, and a job is started to try to access the storage corresponding to BackupRepo. If the access is successful, the status of BackupRepo is set toReady
; otherwise, the status is set toFailed
, and the logs and events of the job are collected and saved for debugging.If the remote storage is accessed based on CSI driver, a configuration error will cause the PVC to fail to provision, resulting in the job being always in a running state. To address this problem, a timeout mechanism is introduced. If the job does not end within 15 minutes, the pre-check is considered to have failed (and the logs and events are also collected at this time).
Here is an example:
Close #5571.