Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: stuck hook issue when a Job resource has a ttlSecondsAfterFinished field set #646

Merged
merged 10 commits into from
Feb 7, 2025

Conversation

dejanzele
Copy link
Contributor

@dejanzele dejanzele commented Dec 4, 2024

This is a proof-of-concept PR which would fix argoproj/argo-cd#21055

More info on the issue can be found in the linked GitHub issue.

Idea is to add a finalizer to Job resources which have ttlSecondsAfterFinished set and remove it after ArgoCD detects the hook completed.

A simpler approach would be to unset the ttlSecondsAfterFinished but that would cause drift from defined vs actual state.

The proposed solution is to attach a finalizer on all hook tasks and remove it after the argocd acknowledges the hook task is completed in the sync phase.

The same scenario which is described in the linked GitHub issue passes in this PR.

I welcome any feedback, as I think a lot of people on the community would like this to be fixed, and I'd be more than happy to adopt based on best direction for this issue.

Comment on lines 886 to 888
if job.Spec.TTLSecondsAfterFinished == nil {
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the downside(s) to always using a finalizer instead of only using one when this field is set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question, the logic would be simpler and more generic if all hooks have a finalizer, cleanup would also be simpler.

Comment on lines 871 to 872
// processJobHookTask processes a hook task where the target object is a Job and has defined ttlSecondsAfterFinished.
// This addresses the issue where a Job with a ttlSecondsAfterFinished set to a low value gets deleted fast and the hook phase gets stuck.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the finalizer feature be limited to just Jobs? I understand that Argo Workflows can also exhibit the same behavior. I could imagine any resource having the same issue if some process deletes the resource before Argo CD has a chance to observe the final resource state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, and yes, that is true.
I focused only on the immediate issue as a PoC, I am also in favour of a generic solution for the scenario where some external process deletes a resource during hook phase.

@dejanzele dejanzele force-pushed the fix/job-ttl-stuck-hook branch 4 times, most recently from 376d9d0 to 57d66fd Compare December 9, 2024 16:08
}
}
if mutated {
task.targetObj.SetFinalizers(finalizers)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I set it at all for the targetObj?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, it's not clear to me.... would you mind investigating/documenting that liveObj and targetObj are on the syncTask struct?

@dejanzele
Copy link
Contributor Author

@crenshaw-dev I have updated the PR to make the solution more generic.

This use case should be also covered by an integration test, but I guess that test would live in the argocd repository?

@crenshaw-dev
Copy link
Member

@dejanzele yep! You can open a PR on the argo-cd repo temporarily replacing gitops-engine in go.mod with your fork and revision.

// In that case, we need to get the latest version of the object and retry the update.
return retry.RetryOnConflict(retry.DefaultRetry, func() error {
updateErr := sc.updateResource(task)
if apierr.IsConflict(updateErr) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflicts happen quite often and without retries the E2E tests were running very flaky.

@dejanzele dejanzele force-pushed the fix/job-ttl-stuck-hook branch from f02987e to 4e93b3e Compare December 25, 2024 16:02
@dejanzele dejanzele requested a review from a team as a code owner December 25, 2024 16:02
Copy link

codecov bot commented Dec 25, 2024

Codecov Report

Attention: Patch coverage is 46.75325% with 41 lines in your changes missing coverage. Please review.

Project coverage is 53.35%. Comparing base (8849c3f) to head (4afcb84).
Report is 28 commits behind head on master.

Files with missing lines Patch % Lines
pkg/sync/sync_context.go 50.72% 29 Missing and 5 partials ⚠️
pkg/sync/hook/hook.go 0.00% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #646      +/-   ##
==========================================
- Coverage   54.26%   53.35%   -0.92%     
==========================================
  Files          64       64              
  Lines        6164     6410     +246     
==========================================
+ Hits         3345     3420      +75     
- Misses       2549     2714     +165     
- Partials      270      276       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Copy link
Member

@agaudreault agaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and updated with latest master. LGTM!

@agaudreault agaudreault merged commit 65db274 into argoproj:master Feb 7, 2025
5 checks passed
Aaron-9900 pushed a commit to Aaron-9900/gitops-engine that referenced this pull request Feb 12, 2025
…ed field set (argoproj#646)

Signed-off-by: Dejan Zele Pejchev <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Co-authored-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Aaron Hoffman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stuck hooks issue when a sync tasks contains a Job resource with a ttlSecondsAfterFinished field set
3 participants