Deployment stuck - Pod in CrashLoopBackOff #47

fogs · 2019-11-05T17:53:18Z

During terraform apply my deployment process gets stuck and terminated after 10 minutes .

This is the last action being executed:

module.helm.module.tiller.kubernetes_deployment.tiller: Creating...
CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil), StorageOS:(*v1.StorageOSVolumeSource)(nil)}}, v1.Volume{Name:"tiller-certs", VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(nil), EmptyDir:(*v1.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*v1.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(0xc000a65500), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1.ISCSIVolumeSource)(nil), Glusterfs:(*v1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(nil), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil), StorageOS:(*v1.StorageOSVolumeSource)(nil)}}}, InitContainers:[]v1.Container(nil), Containers:[]v1.Container{v1.Container{Name:"tiller", Image:"gcr.io/kubernetes-helm/tiller:v2.11.0", Command:[]string{"/tiller"}, Args:[]string{"--storage=secret", "--tls-key=/etc/certs/tls.pem", "--tls-cert=/etc/certs/tls.crt", "--tls-ca-cert=/etc/certs/ca.crt", "--listen=localhost:44134"}, WorkingDir:"", Ports:[]v1.ContainerPort{v1.ContainerPort{Name:"tiller", HostPort:0, ContainerPort:44134, Protocol:"TCP", HostIP:""}, v1.ContainerPort{Name:"http", HostPort:0, ContainerPort:44135, Protocol:"TCP", HostIP:""}}, EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar{v1.EnvVar{Name:"TILLER_NAMESPACE", Value:"tiller", ValueFrom:(*v1.EnvVarSource)(nil)}, v1.EnvVar{Name:"TILLER_HISTORY_MAX", Value:"0", ValueFrom:(*v1.EnvVarSource)(nil)}, v1.EnvVar{Name:"TILLER_TLS_VERIFY", Value:"1", ValueFrom:(*v1.EnvVarSource)(nil)}, v1.EnvVar{Name:"TILLER_TLS_ENABLE", Value:"1", ValueFrom:(*v1.EnvVarSource)(nil)}, v1.EnvVar{Name:"TILLER_TLS_CERTS", Value:"/etc/certs", ValueFrom:(*v1.EnvVarSource)(nil)}}, Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceList(nil)}, VolumeMounts:[]v1.VolumeMount{v1.VolumeMount{Name:"helm-admin-token-42fkx", ReadOnly:true, MountPath:"/var/run/secrets/kubernetes.io/serviceaccount", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(nil)}, v1.VolumeMount{Name:"tiller-certs", ReadOnly:true, MountPath:"/etc/certs", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(nil)}}, VolumeDevices:[]v1.VolumeDevice(nil), LivenessProbe:(*v1.Probe)(0xc000c0f560), ReadinessProbe:(*v1.Probe)(0xc000c0f5f0), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"", ImagePullPolicy:"IfNotPresent", SecurityContext:(*v1.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc000690d50), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string{}, ServiceAccountName:"helm-admin", DeprecatedServiceAccount:"", AutomountServiceAccountToken:(*bool)(0xc000690d4c), NodeName:"", HostNetwork:false, HostPID:false, HostIPC:false, ShareProcessNamespace:(*bool)(0xc000690d4d), SecurityContext:(*v1.PodSecurityContext)(nil), ImagePullSecrets:[]v1.LocalObjectReference{}, Hostname:"", Subdomain:"", Affinity:(*v1.Affinity)(nil), SchedulerName:"", Tolerations:[]v1.Toleration(nil), HostAliases:[]v1.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), DNSConfig:(*v1.PodDNSConfig)(nil), ReadinessGates:[]v1.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), EnableServiceLinks:(*bool)(nil)}}, Strategy:v1.DeploymentStrategy{Type:"", RollingUpdate:(*v1.RollingUpdateDeployment)(nil)}, MinReadySeconds:0, RevisionHistoryLimit:(*int32)(0xc000690d48), Paused:false, ProgressDeadlineSeconds:(*int32)(0xc000690d40)}, Status:v1.DeploymentStatus{ObservedGeneration:0, Replicas:0, UpdatedReplicas:0, ReadyReplicas:0, AvailableReplicas:0, UnavailableReplicas:0, Conditions:[]v1.DeploymentCondition(nil), CollisionCount:(*int32)(nil)}}

It looks like the tiller pod is not coming up but I also don't see any logs from that:

root@master-1:~# kubectl --namespace tiller get pods
NAME                             READY   STATUS             RESTARTS   AGE
tiller-deploy-5c6ccfd9f6-2hcnp   0/1     CrashLoopBackOff   4          2m48s
root@master-1:~# kubectl --namespace tiller logs tiller-deploy-5c6ccfd9f6-2hcnp 
root@master-1:~#

My helm/tiller deployment is embedded as a module and configured as follows:

module "helm" {
  source  = "gruntwork-io/helm/kubernetes"
  version = "0.6.1"
  service_account_name = "helm-admin"
  tiller_namespace = "tiller"
  resource_namespace = "helm"
  kubectl_config_path = module.hcloud.kubectl_config_path
}

Thanks for reading this!

The text was updated successfully, but these errors were encountered:

yorinasub17 · 2019-11-05T18:03:45Z

Can you do kubectl --namespace tiller describe pods tiller-deploy-5c6ccfd9f6-2hcnp? That includes additional event logs with further information that might be the cause of the CrashLoopBackOff.

fogs · 2019-11-06T09:18:24Z

Sure thing!

What I am seeing in there is that the readiness / liveliness URLs have no hostname. On the other hand in the events there is an IP address mentioned. Not sure what to make of that.

root@master-1:~# kubectl --namespace tiller describe pods tiller-deploy-5c6ccfd9f6-2hcnp
Name:         tiller-deploy-5c6ccfd9f6-2hcnp
Namespace:    tiller
Priority:     0
Node:         node-1/###redacted public IP####
Start Time:   Tue, 05 Nov 2019 18:46:23 +0100
Labels:       app=helm
              deployment=tiller-deploy
              name=tiller
              pod-template-hash=5c6ccfd9f6
Annotations:  cni.projectcalico.org/podIP: 192.168.84.130/32
Status:       Running
IP:           192.168.84.130
IPs:
  IP:           192.168.84.130
Controlled By:  ReplicaSet/tiller-deploy-5c6ccfd9f6
Containers:
  tiller:
    Container ID:  docker://d0f56302c9799ee26986f4623090fafa3b6be8634296dff89a9529e32f65c92d
    Image:         gcr.io/kubernetes-helm/tiller:v2.11.0
    Image ID:      docker-pullable://gcr.io/kubernetes-helm/tiller@sha256:f6d8f4ab9ba993b5f5b60a6edafe86352eabe474ffeb84cb6c79b8866dce45d1
    Ports:         44134/TCP, 44135/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /tiller
    Args:
      --storage=secret
      --tls-key=/etc/certs/tls.pem
      --tls-cert=/etc/certs/tls.crt
      --tls-ca-cert=/etc/certs/ca.crt
      --listen=localhost:44134
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 06 Nov 2019 10:08:33 +0100
      Finished:     Wed, 06 Nov 2019 10:09:03 +0100
    Ready:          False
    Restart Count:  309
    Liveness:       http-get http://:44135/liveness delay=1s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:44135/readiness delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment:
      TILLER_NAMESPACE:    tiller
      TILLER_HISTORY_MAX:  0
      TILLER_TLS_VERIFY:   1
      TILLER_TLS_ENABLE:   1
      TILLER_TLS_CERTS:    /etc/certs
    Mounts:
      /etc/certs from tiller-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from helm-admin-token-42fkx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  helm-admin-token-42fkx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  helm-admin-token-42fkx
    Optional:    false
  tiller-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tiller-namespace-tiller-certs
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                     From             Message
  ----     ------     ----                    ----             -------
  Warning  Unhealthy  14m (x891 over 15h)     kubelet, node-1  Readiness probe failed: Get http://192.168.84.130:44135/readiness: dial tcp 192.168.84.130:44135: connect: connection refused
  Warning  BackOff    4m39s (x3733 over 15h)  kubelet, node-1  Back-off restarting failed container
root@master-1:~#

yorinasub17 · 2019-11-06T13:57:25Z

What I am seeing in there is that the readiness / liveliness URLs have no hostname

This is expected. In kubernetes, you can configure liveness/readiness probes with just the port of the container. The output you are seeing is when you only specify the port (which is how it is configured in the module). In this case, the kubelet process will try to reach the Pod using the network IP assigned to it.

connect: connection refused

This indicates that the kubelet process on the node is not able to reach the tiller service that is running on the node. This can be either because:

The tiller process never started
The kubelet process has network connectivity issues reaching the tiller container

What environment are you deploying this in? What flavor of kubernetes (EKS, GKE, AKS, OpenShift, etc)? I noticed you are using calico for CNI: do you have network security policies configured?

fogs · 2019-11-06T16:31:00Z

I am deploying on a bare-metal installation built with the terraform code from https://github.com/fogs/terraform-k8s-hcloud on the Hetzner Cloud. So far I have only tested fireing up a nginx deployment.

Private networks and calico have just been added as feature. I have never configured any network security policies. So this sounds like the main culprit is my environment and not your module.

Is there any way how I can test network connectivity on my cluster as it is required for this module to work?

yorinasub17 · 2019-11-06T19:47:14Z

Is there any way how I can test network connectivity on my cluster as it is required for this module to work?

I am not aware of a good way to test this, but you can try deploying the following manifest and see if it works:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.15.7
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /
            port: 80
        readinessProbe:
          httpGet:
            path: /
            port: 80

This will deploy an nginx deployment with a liveness and readiness probe. The goal of this test is to see if it is a network issue or there is an issue with the pod.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment stuck - Pod in CrashLoopBackOff #47

Deployment stuck - Pod in CrashLoopBackOff #47

fogs commented Nov 5, 2019

yorinasub17 commented Nov 5, 2019

fogs commented Nov 6, 2019

yorinasub17 commented Nov 6, 2019

fogs commented Nov 6, 2019

yorinasub17 commented Nov 6, 2019

Deployment stuck - Pod in CrashLoopBackOff #47

Deployment stuck - Pod in CrashLoopBackOff #47

Comments

fogs commented Nov 5, 2019

yorinasub17 commented Nov 5, 2019

fogs commented Nov 6, 2019

yorinasub17 commented Nov 6, 2019

fogs commented Nov 6, 2019

yorinasub17 commented Nov 6, 2019