-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PB-8410 Incorporate the logic not delete the restore job pods when mo… #402
Conversation
License Evaluation Results:
Total License Issues: 17 |
98b146f
to
21aeaf5
Compare
License Evaluation Results:
Total License Issues: 17 |
21aeaf5
to
153f6b8
Compare
License Evaluation Results:
Total License Issues: 17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
153f6b8
to
a9eb631
Compare
License Evaluation Results:
Total License Issues: 17 |
pkg/drivers/utils/utils.go
Outdated
timeOut = PxbDefaultJobFailureRetryTimeout | ||
} | ||
} | ||
// skipping error cos we know we will never fail here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit :: We can remove this comment as this is moot
pkg/drivers/utils/utils.go
Outdated
@@ -79,6 +79,10 @@ const ( | |||
kopiaBackupString = "kopiaexecutor backup" | |||
// if providerType in node spec has this string then it is GCP hosted cluster | |||
GCPBasedClusterString = "gce://" | |||
//PxbJobFailureRetryTimeoutKey defines timeout key name to be set after job failure due to mount failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Space after //
pkg/drivers/utils/utils.go
Outdated
// returns error if configMap cannot be read from k8s | ||
// returns error if key is not found in configMap | ||
// returns value of the key if key is found in configMap | ||
func GetDataFromConfigMap(name, namespace, key string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we re-use existing API GetConfigValue()?
pkg/drivers/utils/utils.go
Outdated
// we could fail here if the value set is invalid or has some junk charectors. | ||
duration, err := time.ParseDuration(timeOut + "s") | ||
if err != nil || duration <= 0 { | ||
logrus.Errorf("invalid %v value set. Should be numberic value > 0. Setting to default limit", PxbJobFailureRetryTimeoutKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add fn in the above debug msg as well?
pkg/drivers/utils/utils.go
Outdated
logrus.Debugf("%v:failed retry limit not found. Setting to default: %v", fn, err) | ||
timeOut = PxbDefaultJobFailureRetryTimeout | ||
} else { | ||
// we could fail here if the value set is invalid or has some junk charectors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: charectors
@vsundarraj-px Should we add it for nfsDelete and kopiaDelete as well? |
|
…unt failure occurs within 5 mins Signed-off-by: vsundarraj-px <[email protected]>
a9eb631
to
58b8ad2
Compare
License Evaluation Results:
Total License Issues: 17 |
This PR is closed as the PR #404 |
…unt failure occurs within 5 mins
What this PR does / why we need it:
This PR handles Job pod mount failures which are transient. Currently while checking for job status and if we find that mount pvc mount has failed, we immediately kill the jobpod and return error. We noticed that this pvc mount failure are transient errors and mount actually succeeds in next few seconds. Hence, this PR enables a timeout settings through KDMP configMap setting. When set, if mount failure should occur, will not fail immediately but wait till the timout set and if still the mount failure occurs then the job is terminated with error.
Which issue(s) this PR fixes (optional)
Closes # PB-8410
Special notes for your reviewer:
Unit testing done
NFS Restore
NFS Backup
KDMP Backup
KDMP Restore
Since the issue is not easily reproducible, the unit testing is performed with a debug image by forcefully setting mountFailure