New job for GitRepo is created and terminated every 3rd second #2853

Marza · 2024-09-16T14:05:21Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

We have a Rancher installation (version 2.9.1), this problem likely started after upgrading from 2.8.x.
We have 3 GitRepos, but only one of them are experiencing this problem. All point to the same Git repository in BitBucket but with different paths. We run on EKS 1.28 currently but plan on upgrading to EKS 1.29 soon.

For one of these GitRepos a job/pod is created roughly every 3 seconds and then it is terminated (usually), but sometimes they get stuck and we run out of IP-addresses in the subnet. The other GitRepos only see new jobs occasionally or when changes are done to the backing Git repository.

The GitRepo with problem also has this warning/error which we don't understand why it is there:

User "system:serviceaccount:cattle-fleet-system:fleet-controller" cannot create resource "gitjobs" in API group "gitjob.cattle.io" in the namespace ""

Expected Behavior

Pods are not created every 3rd second.

Steps To Reproduce

No response

Environment

- Architecture: amd64
- Fleet Version: v0.10.1
- Cluster:
  - Provider: EKS
  - Options: 4 nodes, master node running fleet-controller is c6i.12xlarge to accommodate the number of clusters. Ingress Nginx, AWS LB.
  - Kubernetes Version: 1.28

Logs

stream logs failed container "fleet" in pod "<gitjob>-f0e21-g5cjr" is waiting to start: PodInitializing for <namespace>/<gitjob>-f0e21-g5cjr (fleet)
gitcloner-initializer time="2024-09-16T14:03:58Z" level=warning msg="signal received: \"terminated\", canceling context..."
Stream closed EOF for <namespace>/<gitjob>-f0e21-g5cjr (gitcloner-initializer)

Anything else?

We see a lot of logs like this even though no changes are made to the backing Git repo in Bitbucket.

{"level":"info","ts":"2024-09-16T14:07:30Z","logger":"clustergroup-cluster-handler","msg":"Cluster changed, enqueue matching cluster groups","namespace":"<namespace>","name":"cluster-8cf77d5971e8"}

Marza · 2024-09-27T08:12:25Z

The problem is that the GitRepo keeps the latest commit hash from the backing git repository, but that commit hash is wrong and isn't updated correctly. Initially it was blank, but after re-creating the GitRepo it first worked but then got stuck soon afterwards.

Since the commit hash is wrong Rancher fleet thinks there are changes all the time and tries to trigger updates.

weyfonk · 2024-10-07T11:47:59Z

It looks like there may be an issue with the gitjob pod using an older Fleet image, as per:

User "system:serviceaccount:cattle-fleet-system:fleet-controller" cannot create resource "gitjobs" in API group "gitjob.cattle.io" in the namespace ""

While the gitjob pod still exists as part of Fleet controller deployments, the gitjob resource (CRD) itself has been removed in Fleet 0.10, and is no longer needed to create jobs for GitRepos.

That doesn't explain why this issue would only happen for only one GitRepo though... Do other GitRepos live in the same management cluster as the failing one?
Which fleet container image version is in use in the gitjob pod?
What does helm list -A output on the management cluster(s)?

Marza · 2024-10-07T12:31:23Z

Yes, all GitRepos are in the same cluster and namespace.
Fleet-controller is using rancher/fleet:v0.10.2, same version for the gitjob pods.

% helm list -A
NAME                        	NAMESPACE                      	REVISION	UPDATED                                 	STATUS  	CHART                                                                                   	APP VERSION
aws-load-balancer-controller	kube-system                    	4       	2024-05-15 15:48:06.58813 +0200 CEST    	deployed	aws-load-balancer-controller-1.7.2                                                      	v2.7.2     
external-dns                	utilities                      	6       	2024-09-05 16:20:05.551489313 +0200 CEST	deployed	external-dns-7.3.2                                                                      	0.14.1     
fleet                       	cattle-fleet-system            	20      	2024-09-27 10:18:06.896019517 +0000 UTC 	deployed	fleet-104.0.2+up0.10.2                                                                  	0.10.2     
fleet-agent-local           	cattle-fleet-local-system      	2423    	2024-09-27 10:22:57.650569942 +0000 UTC 	deployed	fleet-agent-local-v0.0.0+s-766b73b65b86b4bc4c0dffcec2736a376793eda8e9de6434b95f17156588e	           
fleet-crd                   	cattle-fleet-system            	16      	2024-09-27 10:18:00.100705846 +0000 UTC 	deployed	fleet-crd-104.0.2+up0.10.2                                                              	0.10.2     
ingress-nginx               	utilities                      	14      	2024-05-20 07:42:54.078454 +0200 CEST   	deployed	ingress-nginx-4.10.1                                                                    	1.10.1     
prometheus                  	monitoring                     	5       	2024-09-05 16:23:58.746117659 +0200 CEST	deployed	prometheus-25.8.2                                                                       	v2.48.1    
rancher                     	cattle-system                  	10      	2024-09-27 12:16:45.760551105 +0200 CEST	deployed	rancher-2.9.2                                                                           	v2.9.2     
rancher-backup              	cattle-resources-system        	1       	2022-06-08 06:32:32.270630933 +0000 UTC 	deployed	rancher-backup-2.1.2                                                                    	2.1.2      
rancher-backup-crd          	cattle-resources-system        	1       	2022-06-08 06:32:29.534008213 +0000 UTC 	deployed	rancher-backup-crd-2.1.2                                                                	2.1.2      
rancher-provisioning-capi   	cattle-provisioning-capi-system	4       	2024-09-11 10:20:47.400371031 +0000 UTC 	deployed	rancher-provisioning-capi-104.0.0+up0.3.0                                               	1.7.3      
rancher-webhook             	cattle-system                  	13      	2024-09-27 10:18:25.028320785 +0000 UTC 	deployed	rancher-webhook-104.0.2+up0.5.2                                                         	0.5.2

Installed using Rancher 2.9.2 helm chart.

Marza · 2024-10-07T12:32:50Z

We have upgraded to 2.9.2 since raising this issue, but the problem still exists, then main GitRepo shows wrong git commit hash.

manno · 2024-10-23T13:43:16Z

Cleaning up the backlog, we can't reproduce this.

Marza added the kind/bug label Sep 16, 2024

rancherbot added this to Fleet Sep 16, 2024

github-project-automation bot moved this to 🆕 New in Fleet Sep 16, 2024

kkaempf added this to the v2.9.3 milestone Sep 16, 2024

kkaempf modified the milestones: v2.9.3, 2.9.4 Oct 2, 2024

kkaempf moved this from 🆕 New to To Triage in Fleet Oct 2, 2024

kkaempf assigned weyfonk Oct 2, 2024

kkaempf added the area/performance label Oct 2, 2024

weyfonk removed their assignment Oct 7, 2024

manno closed this as completed Oct 23, 2024

github-project-automation bot moved this from To Triage to ✅ Done in Fleet Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New job for GitRepo is created and terminated every 3rd second #2853

New job for GitRepo is created and terminated every 3rd second #2853

Marza commented Sep 16, 2024 •

edited

Loading

Marza commented Sep 27, 2024

weyfonk commented Oct 7, 2024

Marza commented Oct 7, 2024

Marza commented Oct 7, 2024

manno commented Oct 23, 2024

New job for GitRepo is created and terminated every 3rd second #2853

New job for GitRepo is created and terminated every 3rd second #2853

Comments

Marza commented Sep 16, 2024 • edited Loading

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Logs

Anything else?

Marza commented Sep 27, 2024

weyfonk commented Oct 7, 2024

Marza commented Oct 7, 2024

Marza commented Oct 7, 2024

manno commented Oct 23, 2024

Marza commented Sep 16, 2024 •

edited

Loading