Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Restarting existing cluster with multiple server nodes hangs #1526

Open
alichtl opened this issue Nov 15, 2024 · 0 comments
Open

[BUG] Restarting existing cluster with multiple server nodes hangs #1526

alichtl opened this issue Nov 15, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@alichtl
Copy link

alichtl commented Nov 15, 2024

What did you do

  • How was the cluster created?

    • k3d cluster create demo --servers=3 --agents=2 --k3s-arg '--kubelet-arg=feature-gates=KubeletInUserNamespace=true@server:*;agent:*'
  • What did you do afterwards?

    • k3d cluster list
      • Name: demo
      • Servers: 3/3
      • Agents: 2/2
      • Loadbalancer: true
    • k3d cluster stop demo
    • k3d cluster list
      • Name: demo
      • Servers: 0/3
      • Agents: 0/2
      • Loadbalancer: true
    • k3d cluster start demo --trace
      • TRAC[0001] NodeWaitForLogMessage: Node 'k3d-demo-server-0' waiting for log message 'Running kube-apiserver'
        • Succeeds
      • TRAC[0009] NodeWaitForLogMessage: Node 'k3d-demo-server-1' waiting for log message 'k3s is up and running' since '2024-11-15 17:02:49 +0000 UTC'
        • Hangs (note that it's waiting for a different log message, which in this case I think is only in the logs once, when the node was first created).

What did you expect to happen

  • Expect the cluster to come back up

Screenshots or terminal output

  • output of k3d cluster start demo --trace
DEBU[0000] DOCKER_SOCK=/run/user/1000/podman/podman.sock
DEBU[0000] Runtime Info:
&{Name:docker Endpoint:/run/user/1000/podman/podman.sock Version:4.9.4-rhel OSType:linux OS:rocky Arch:amd64 CgroupVersion:2 CgroupDriver:systemd Filesystem:xfs InfoName:xxxxx}
TRAC[0000] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-demo-agent-0
TRAC[0000] failed to parse IP '' for container '/k3d-demo-agent-0', likely because it's not running (or restarting): ParseAddr(""): unable to parse IP
TRAC[0000] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-demo-serverlb
TRAC[0000] failed to parse IP '' for container '/k3d-demo-serverlb', likely because it's not running (or restarting): ParseAddr(""): unable to parse IP
TRAC[0000] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-demo-server-1
TRAC[0000] failed to parse IP '' for container '/k3d-demo-server-1', likely because it's not running (or restarting): ParseAddr(""): unable to parse IP
TRAC[0000] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-demo-agent-1
TRAC[0000] failed to parse IP '' for container '/k3d-demo-agent-1', likely because it's not running (or restarting): ParseAddr(""): unable to parse IP
TRAC[0000] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-demo-server-2
TRAC[0000] failed to parse IP '' for container '/k3d-demo-server-2', likely because it's not running (or restarting): ParseAddr(""): unable to parse IP
TRAC[0000] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-demo-server-0
TRAC[0000] failed to parse IP '' for container '/k3d-demo-server-0', likely because it's not running (or restarting): ParseAddr(""): unable to parse IP
TRAC[0000] Reading path /etc/confd/values.yaml from node k3d-demo-serverlb...
DEBU[0000] DOCKER_SOCK=/run/user/1000/podman/podman.sock
INFO[0000] Using the k3d-tools node to gather environment information
INFO[0000] Starting new tools node...
DEBU[0000] DOCKER_SOCK=/run/user/1000/podman/podman.sock
DEBU[0000] DOCKER_SOCK=/run/user/1000/podman/podman.sock
TRAC[0000] Creating node from spec
&{Name:k3d-demo-tools Role:noRole Image:ghcr.io/k3d-io/k3d-tools:5.7.4 Volumes:[k3d-demo-images:/k3d/images /run/user/1000/podman/podman.sock:/run/user/1000/podman/podman.sock] Env:[] Cmd:[] Args:[noop] Files:[] Ports:map[] Restart:false Created: HostPidMode:false RuntimeLabels:map[app:k3d k3d.cluster:demo k3d.version:v5.7.4] RuntimeUlimits:[] K3sNodeLabels:map[] Networks:[k3d-demo] ExtraHosts:[host.k3d.internal:host-gateway] ServerOpts:{IsInit:false KubeAPI:<nil>} AgentOpts:{} GPURequest: Memory: State:{Running:false Status: Started:} IP:{IP:invalid IP Static:false} HookActions:[] K3dEntrypoint:false}
DEBU[0000] DOCKER_SOCK=/run/user/1000/podman/podman.sock
DEBU[0000] [autofix cgroupsv2] cgroupVersion: 2
TRAC[0000] Creating docker container with translated config
&{ContainerConfig:{Hostname:k3d-demo-tools Domainname: User: AttachStdin:false AttachStdout:false AttachStderr:false ExposedPorts:map[] Tty:false OpenStdin:false StdinOnce:false Env:[K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml] Cmd:[noop] Healthcheck:<nil> ArgsEscaped:false Image:ghcr.io/k3d-io/k3d-tools:5.7.4 Volumes:map[] WorkingDir: Entrypoint:[] NetworkDisabled:false MacAddress: OnBuild:[] Labels:map[app:k3d k3d.cluster:demo k3d.role:noRole k3d.version:v5.7.4] StopSignal: StopTimeout:<nil> Shell:[]} HostConfig:{Binds:[k3d-demo-images:/k3d/images /run/user/1000/podman/podman.sock:/run/user/1000/podman/podman.sock] ContainerIDFile: LogConfig:{Type: Config:map[]} NetworkMode:bridge PortBindings:map[] RestartPolicy:{Name: MaximumRetryCount:0} AutoRemove:false VolumeDriver: VolumesFrom:[] ConsoleSize:[0 0] Annotations:map[] CapAdd:[] CapDrop:[] CgroupnsMode: DNS:[] DNSOptions:[] DNSSearch:[] ExtraHosts:[host.k3d.internal:host-gateway] GroupAdd:[] IpcMode: Cgroup: Links:[] OomScoreAdj:0 PidMode: Privileged:true PublishAllPorts:false ReadonlyRootfs:false SecurityOpt:[] StorageOpt:map[] Tmpfs:map[/run: /var/run:] UTSMode: UsernsMode:host ShmSize:0 Sysctls:map[] Runtime: Isolation: Resources:{CPUShares:0 Memory:0 NanoCPUs:0 CgroupParent: BlkioWeight:0 BlkioWeightDevice:[] BlkioDeviceReadBps:[] BlkioDeviceWriteBps:[] BlkioDeviceReadIOps:[] BlkioDeviceWriteIOps:[] CPUPeriod:0 CPUQuota:0 CPURealtimePeriod:0 CPURealtimeRuntime:0 CpusetCpus: CpusetMems: Devices:[] DeviceCgroupRules:[] DeviceRequests:[] KernelMemory:0 KernelMemoryTCP:0 MemoryReservation:0 MemorySwap:0 MemorySwappiness:<nil> OomKillDisable:<nil> PidsLimit:<nil> Ulimits:[] CPUCount:0 CPUPercent:0 IOMaximumIOps:0 IOMaximumBandwidth:0} Mounts:[] MaskedPaths:[] ReadonlyPaths:[] Init:0xc0002fca3a} NetworkingConfig:{EndpointsConfig:map[k3d-demo:0xc0000e48c0]}}
DEBU[0000] Created container k3d-demo-tools (ID: 5dff412d4d9856e12ab22080326c8a9721cbf544f0ad28727b3ecea47d193ab0)
DEBU[0000] Node k3d-demo-tools Start Time: 2024-11-15 09:02:40.234061817 -0800 PST m=+0.432206265
TRAC[0000] Starting node 'k3d-demo-tools'
INFO[0000] Starting node 'k3d-demo-tools'
DEBU[0000] Truncated 2024-11-15 17:02:40.370633872 +0000 UTC to 2024-11-15 17:02:40 +0000 UTC
DEBU[0000] Deleting node k3d-demo-tools ...
TRAC[0000] [Docker] Deleted Container k3d-demo-tools
DEBU[0000] DOCKER_SOCK=/run/user/1000/podman/podman.sock
TRAC[0000] GOOS: linux / Runtime OS: linux (rocky)
INFO[0000] HostIP: using network gateway 10.89.0.1 address
INFO[0000] Starting cluster 'demo'
INFO[0000] Starting the initializing server...
DEBU[0000] >>> enabling dns magic
DEBU[0000] >>> enabling cgroupsv2 magic
DEBU[0000] >>> enabling mounts magic
DEBU[0000] Node k3d-demo-server-0 Start Time: 2024-11-15 09:02:40.612217292 -0800 PST m=+0.810361739
TRAC[0000] Node k3d-demo-server-0: Executing preStartAction 'WriteFileAction': [WriteFileAction] Writing 943 bytes to /bin/k3d-entrypoint.sh (mode -rwxr--r--): Write custom k3d entrypoint script (that powers the magic fixes)
TRAC[0000] Node k3d-demo-server-0: Executing preStartAction 'WriteFileAction': [WriteFileAction] Writing 1432 bytes to /bin/k3d-entrypoint-dns.sh (mode -rwxr--r--): Write entrypoint script for DNS fix
TRAC[0000] Node k3d-demo-server-0: Executing preStartAction 'WriteFileAction': [WriteFileAction] Writing 1512 bytes to /bin/k3d-entrypoint-cgroupv2.sh (mode -rwxr--r--): Write entrypoint script for CGroupV2 fix
TRAC[0000] Node k3d-demo-server-0: Executing preStartAction 'WriteFileAction': [WriteFileAction] Writing 124 bytes to /bin/k3d-entrypoint-mounts.sh (mode -rwxr--r--): Write entrypoint script for mounts fix
TRAC[0000] Starting node 'k3d-demo-server-0'
INFO[0000] Starting node 'k3d-demo-server-0'
DEBU[0001] Truncated 2024-11-15 17:02:40.858819108 +0000 UTC to 2024-11-15 17:02:40 +0000 UTC
DEBU[0001] Waiting for node k3d-demo-server-0 to get ready (Log: 'Running kube-apiserver')
TRAC[0001] NodeWaitForLogMessage: Node 'k3d-demo-server-0' waiting for log message 'Running kube-apiserver' since '2024-11-15 17:02:40 +0000 UTC'
TRAC[0009] Found target message `running kube-apiserver` in log line `  ime="2024-11-15T17:02:49Z" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,k3s --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --egress-selector-config-file=/var/lib/rancher/k3s/server/etc/egress-selector-config.yaml --enable-admission-plugins=NodeRestriction --enable-aggregator-routing=true --enable-bootstrap-token-auth=true --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.current.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305 --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"`
DEBU[0009] Finished waiting for log message 'running kube-apiserver' from node 'k3d-demo-server-0'
INFO[0009] Starting servers...
DEBU[0009] >>> enabling dns magic
DEBU[0009] >>> enabling cgroupsv2 magic
DEBU[0009] >>> enabling mounts magic
DEBU[0009] Node k3d-demo-server-1 Start Time: 2024-11-15 09:02:49.453027866 -0800 PST m=+9.651172314
TRAC[0009] Node k3d-demo-server-1: Executing preStartAction 'WriteFileAction': [WriteFileAction] Writing 943 bytes to /bin/k3d-entrypoint.sh (mode -rwxr--r--): Write custom k3d entrypoint script (that powers the magic fixes)
TRAC[0009] Node k3d-demo-server-1: Executing preStartAction 'WriteFileAction': [WriteFileAction] Writing 1432 bytes to /bin/k3d-entrypoint-dns.sh (mode -rwxr--r--): Write entrypoint script for DNS fix
TRAC[0009] Node k3d-demo-server-1: Executing preStartAction 'WriteFileAction': [WriteFileAction] Writing 1512 bytes to /bin/k3d-entrypoint-cgroupv2.sh (mode -rwxr--r--): Write entrypoint script for CGroupV2 fix
TRAC[0009] Node k3d-demo-server-1: Executing preStartAction 'WriteFileAction': [WriteFileAction] Writing 124 bytes to /bin/k3d-entrypoint-mounts.sh (mode -rwxr--r--): Write entrypoint script for mounts fix
TRAC[0009] Starting node 'k3d-demo-server-1'
INFO[0009] Starting node 'k3d-demo-server-1'
DEBU[0009] Truncated 2024-11-15 17:02:49.648195382 +0000 UTC to 2024-11-15 17:02:49 +0000 UTC
DEBU[0009] Waiting for node k3d-demo-server-1 to get ready (Log: 'k3s is up and running')
TRAC[0009] NodeWaitForLogMessage: Node 'k3d-demo-server-1' waiting for log message 'k3s is up and running' since '2024-11-15 17:02:49 +0000 UTC'

Which OS & Architecture

  • output of k3d runtime-info
arch: amd64
cgroupdriver: systemd
cgroupversion: "2"
endpoint: /run/user/1000/podman/podman.sock
filesystem: xfs
infoname: xxxxx
name: docker
os: rocky
ostype: linux
version: 4.9.4-rhel

Which version of k3d

  • output of k3d version
k3d version v5.7.4
k3s version v1.30.4-k3s1 (default)

Which version of docker

  • output of podman version
Client:       Podman Engine
Version:      4.9.4-rhel
API Version:  4.9.4-rhel
Go Version:   go1.21.13 (Red Hat 1.21.13-4.el9_4)
Built:        Mon Oct 14 01:24:19 2024
OS/Arch:      linux/amd64
  • output of podman info
host:
  arch: amd64
  buildahVersion: 1.33.8
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.10-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.10, commit: 3ea3d7f99779af0fcd69ec16c211a7dc3b4efb60'
  cpuUtilization:
    idlePercent: 99.56
    systemPercent: 0.12
    userPercent: 0.32
  cpus: 4
  databaseBackend: boltdb
  distribution:
    distribution: rocky
    version: "9.4"
  eventLogger: journald
  freeLocks: 2020
  hostname: xxxxx
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.14.0-427.33.1.el9_4.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 49497591808
  memTotal: 66851160064
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.10.0-3.el9_4.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.10.0
    package: netavark-1.10.3-1.el9.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.10.3
  ociRuntime:
    name: crun
    package: crun-1.14.3-1.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.14.3
      commit: 1961d211ba98f532ea52d2e80f4c20359f241a98
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: ""
    package: ""
    version: ""
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.3-1.el9.x86_64
    version: |-
      slirp4netns version 1.2.3
      commit: c22fde291bb35b354e6ca44d13be181c76a0a432
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 34342957056
  swapTotal: 34342957056
  uptime: 1236h 4m 26.00s (Approximately 51.50 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /home/xxxxx/.config/containers/storage.conf
  containerStore:
    number: 6
    paused: 0
    running: 2
    stopped: 4
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/xxxxx/.local/share/containers/storage
  graphRootAllocated: 1588357763072
  graphRootUsed: 285157888000
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 93
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/xxxxx/.local/share/containers/storage/volumes
version:
  APIVersion: 4.9.4-rhel
  Built: 1728894259
  BuiltTime: Mon Oct 14 01:24:19 2024
  GitCommit: ""
  GoVersion: go1.21.13 (Red Hat 1.21.13-4.el9_4)
  Os: linux
  OsArch: linux/amd64
  Version: 4.9.4-rhel
@alichtl alichtl added the bug Something isn't working label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant