Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New job for GitRepo is created and terminated every 3rd second #2853

Closed
1 task done
Marza opened this issue Sep 16, 2024 · 5 comments
Closed
1 task done

New job for GitRepo is created and terminated every 3rd second #2853

Marza opened this issue Sep 16, 2024 · 5 comments

Comments

@Marza
Copy link

Marza commented Sep 16, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We have a Rancher installation (version 2.9.1), this problem likely started after upgrading from 2.8.x.
We have 3 GitRepos, but only one of them are experiencing this problem. All point to the same Git repository in BitBucket but with different paths. We run on EKS 1.28 currently but plan on upgrading to EKS 1.29 soon.

For one of these GitRepos a job/pod is created roughly every 3 seconds and then it is terminated (usually), but sometimes they get stuck and we run out of IP-addresses in the subnet. The other GitRepos only see new jobs occasionally or when changes are done to the backing Git repository.

The GitRepo with problem also has this warning/error which we don't understand why it is there:

User "system:serviceaccount:cattle-fleet-system:fleet-controller" cannot create resource "gitjobs" in API group "gitjob.cattle.io" in the namespace ""

Expected Behavior

Pods are not created every 3rd second.

Steps To Reproduce

No response

Environment

- Architecture: amd64
- Fleet Version: v0.10.1
- Cluster:
  - Provider: EKS
  - Options: 4 nodes, master node running fleet-controller is c6i.12xlarge to accommodate the number of clusters. Ingress Nginx, AWS LB.
  - Kubernetes Version: 1.28

Logs

stream logs failed container "fleet" in pod "<gitjob>-f0e21-g5cjr" is waiting to start: PodInitializing for <namespace>/<gitjob>-f0e21-g5cjr (fleet)
gitcloner-initializer time="2024-09-16T14:03:58Z" level=warning msg="signal received: \"terminated\", canceling context..."
Stream closed EOF for <namespace>/<gitjob>-f0e21-g5cjr (gitcloner-initializer)

Anything else?

We see a lot of logs like this even though no changes are made to the backing Git repo in Bitbucket.

{"level":"info","ts":"2024-09-16T14:07:30Z","logger":"clustergroup-cluster-handler","msg":"Cluster changed, enqueue matching cluster groups","namespace":"<namespace>","name":"cluster-8cf77d5971e8"}
@Marza Marza added the kind/bug label Sep 16, 2024
@rancherbot rancherbot added this to Fleet Sep 16, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Fleet Sep 16, 2024
@kkaempf kkaempf added this to the v2.9.3 milestone Sep 16, 2024
@Marza
Copy link
Author

Marza commented Sep 27, 2024

The problem is that the GitRepo keeps the latest commit hash from the backing git repository, but that commit hash is wrong and isn't updated correctly. Initially it was blank, but after re-creating the GitRepo it first worked but then got stuck soon afterwards.

Since the commit hash is wrong Rancher fleet thinks there are changes all the time and tries to trigger updates.

@kkaempf kkaempf modified the milestones: v2.9.3, 2.9.4 Oct 2, 2024
@kkaempf kkaempf moved this from 🆕 New to To Triage in Fleet Oct 2, 2024
@weyfonk
Copy link
Contributor

weyfonk commented Oct 7, 2024

It looks like there may be an issue with the gitjob pod using an older Fleet image, as per:

User "system:serviceaccount:cattle-fleet-system:fleet-controller" cannot create resource "gitjobs" in API group "gitjob.cattle.io" in the namespace ""

While the gitjob pod still exists as part of Fleet controller deployments, the gitjob resource (CRD) itself has been removed in Fleet 0.10, and is no longer needed to create jobs for GitRepos.

That doesn't explain why this issue would only happen for only one GitRepo though... Do other GitRepos live in the same management cluster as the failing one?
Which fleet container image version is in use in the gitjob pod?
What does helm list -A output on the management cluster(s)?

@weyfonk weyfonk removed their assignment Oct 7, 2024
@Marza
Copy link
Author

Marza commented Oct 7, 2024

Yes, all GitRepos are in the same cluster and namespace.
Fleet-controller is using rancher/fleet:v0.10.2, same version for the gitjob pods.

% helm list -A
NAME                        	NAMESPACE                      	REVISION	UPDATED                                 	STATUS  	CHART                                                                                   	APP VERSION
aws-load-balancer-controller	kube-system                    	4       	2024-05-15 15:48:06.58813 +0200 CEST    	deployed	aws-load-balancer-controller-1.7.2                                                      	v2.7.2     
external-dns                	utilities                      	6       	2024-09-05 16:20:05.551489313 +0200 CEST	deployed	external-dns-7.3.2                                                                      	0.14.1     
fleet                       	cattle-fleet-system            	20      	2024-09-27 10:18:06.896019517 +0000 UTC 	deployed	fleet-104.0.2+up0.10.2                                                                  	0.10.2     
fleet-agent-local           	cattle-fleet-local-system      	2423    	2024-09-27 10:22:57.650569942 +0000 UTC 	deployed	fleet-agent-local-v0.0.0+s-766b73b65b86b4bc4c0dffcec2736a376793eda8e9de6434b95f17156588e	           
fleet-crd                   	cattle-fleet-system            	16      	2024-09-27 10:18:00.100705846 +0000 UTC 	deployed	fleet-crd-104.0.2+up0.10.2                                                              	0.10.2     
ingress-nginx               	utilities                      	14      	2024-05-20 07:42:54.078454 +0200 CEST   	deployed	ingress-nginx-4.10.1                                                                    	1.10.1     
prometheus                  	monitoring                     	5       	2024-09-05 16:23:58.746117659 +0200 CEST	deployed	prometheus-25.8.2                                                                       	v2.48.1    
rancher                     	cattle-system                  	10      	2024-09-27 12:16:45.760551105 +0200 CEST	deployed	rancher-2.9.2                                                                           	v2.9.2     
rancher-backup              	cattle-resources-system        	1       	2022-06-08 06:32:32.270630933 +0000 UTC 	deployed	rancher-backup-2.1.2                                                                    	2.1.2      
rancher-backup-crd          	cattle-resources-system        	1       	2022-06-08 06:32:29.534008213 +0000 UTC 	deployed	rancher-backup-crd-2.1.2                                                                	2.1.2      
rancher-provisioning-capi   	cattle-provisioning-capi-system	4       	2024-09-11 10:20:47.400371031 +0000 UTC 	deployed	rancher-provisioning-capi-104.0.0+up0.3.0                                               	1.7.3      
rancher-webhook             	cattle-system                  	13      	2024-09-27 10:18:25.028320785 +0000 UTC 	deployed	rancher-webhook-104.0.2+up0.5.2                                                         	0.5.2      

Installed using Rancher 2.9.2 helm chart.

@Marza
Copy link
Author

Marza commented Oct 7, 2024

We have upgraded to 2.9.2 since raising this issue, but the problem still exists, then main GitRepo shows wrong git commit hash.

@manno
Copy link
Member

manno commented Oct 23, 2024

Cleaning up the backlog, we can't reproduce this.

@manno manno closed this as completed Oct 23, 2024
@github-project-automation github-project-automation bot moved this from To Triage to ✅ Done in Fleet Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants