Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] risingwave cluster component connector keeps restarting on GKE #5459

Closed
JashBook opened this issue Oct 16, 2023 · 4 comments · Fixed by #5712
Closed

[BUG] risingwave cluster component connector keeps restarting on GKE #5459

JashBook opened this issue Oct 16, 2023 · 4 comments · Fixed by #5712
Assignees
Labels
bug kind/bug Something isn't working severity/major Great chance user will encounter the same problem
Milestone

Comments

@JashBook
Copy link
Collaborator

JashBook commented Oct 16, 2023

Describe the bug
risingwave cluster component connector keeps restarting on GKE:
Readiness probe failed: dial tcp 10.0.31.101:50051: connect: connection refused

kbcli version
Kubernetes: v1.26.7-gke.500
KubeBlocks: 0.7.0-beta.4
kbcli: 0.7.0-beta.4

To Reproduce
Steps to reproduce the behavior:

  1. create etcd cluster
  2. create risingwave cluster
  3. See error
kubectl get pod 
NAME                                            READY   STATUS             RESTARTS      AGE
etcd-akwclj-etcd-0                              4/4     Running   0            6m17s
etcd-akwclj-etcd-1                              4/4     Running   0            6m17s
etcd-akwclj-etcd-2                              4/4     Running   0            6m17s
rswave-akwclj-compactor-0                       1/1     Running            0             10m
rswave-akwclj-compute-0                         1/1     Running            0             10m
rswave-akwclj-connector-0                       0/1     CrashLoopBackOff   6 (94s ago)   10m
rswave-akwclj-frontend-0                        1/1     Running            0             10m
rswave-akwclj-frontend-1                        1/1     Running            0             9m45s
rswave-akwclj-meta-0                            1/1     Running            0             10m

logs error pod

kubectl logs rswave-akwclj-connector-0 
2023-10-16 06:25:50,717 INFO  [main] connector.ConnectorService:57 - Server started, listening on 50051
2023-10-16 06:25:52,513 INFO  [main] connector.ConnectorService:64 - Prometheus metrics server started, listening on 50052

describe error pod

kubectl describe pod rswave-akwclj-connector-0 
Name:             rswave-akwclj-connector-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             gke-cicd-gke-test-cicd-gke-test-3328853d-pwrz/10.128.0.73
Start Time:       Mon, 16 Oct 2023 14:17:42 +0800
Labels:           app.kubernetes.io/component=connector
                  app.kubernetes.io/instance=rswave-akwclj
                  app.kubernetes.io/managed-by=kubeblocks
                  app.kubernetes.io/name=risingwave
                  app.kubernetes.io/version=risingwave-v1.0.0
                  apps.kubeblocks.io/component-name=connector
                  apps.kubeblocks.io/workload-type=Stateless
                  controller-revision-hash=rswave-akwclj-connector-6b7bbb8d9c
                  statefulset.kubernetes.io/pod-name=rswave-akwclj-connector-0
Annotations:      apps.kubeblocks.io/component-replicas: 1
Status:           Running
IP:               10.0.31.101
IPs:
  IP:           10.0.31.101
Controlled By:  StatefulSet/rswave-akwclj-connector
Containers:
  connector:
    Container ID:  containerd://ba40b48be13287e3408cd3ce3c533c45ff4f254a07cb9025a0d91a39eddfd6e2
    Image:         ghcr.io/risingwavelabs/risingwave:v1.0.0
    Image ID:      ghcr.io/risingwavelabs/risingwave@sha256:d85922cbe904b794aee102a25304c3cd181cfc792dd190d87de011621cfb49fb
    Ports:         50051/TCP, 50052/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /risingwave/bin/connector-node/start-service.sh
    Args:
      -p
      50051
    State:          Running
      Started:      Mon, 16 Oct 2023 14:28:55 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Mon, 16 Oct 2023 14:25:14 +0800
      Finished:     Mon, 16 Oct 2023 14:26:12 +0800
    Ready:          False
    Restart Count:  7
    Limits:
      cpu:     100m
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   1Gi
    Liveness:   tcp-socket :svc delay=5s timeout=30s period=10s #success=1 #failure=3
    Readiness:  tcp-socket :svc delay=5s timeout=30s period=10s #success=1 #failure=3
    Environment Variables from:
      rswave-akwclj-connector-env      ConfigMap  Optional: false
      rswave-akwclj-connector-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:                        rswave-akwclj-connector-0 (v1:metadata.name)
      KB_POD_UID:                          (v1:metadata.uid)
      KB_NAMESPACE:                       default (v1:metadata.namespace)
      KB_SA_NAME:                          (v1:spec.serviceAccountName)
      KB_NODENAME:                         (v1:spec.nodeName)
      KB_HOST_IP:                          (v1:status.hostIP)
      KB_POD_IP:                           (v1:status.podIP)
      KB_POD_IPS:                          (v1:status.podIPs)
      KB_HOSTIP:                           (v1:status.hostIP)
      KB_PODIP:                            (v1:status.podIP)
      KB_PODIPS:                           (v1:status.podIPs)
      KB_CLUSTER_NAME:                    rswave-akwclj
      KB_COMP_NAME:                       connector
      KB_CLUSTER_COMP_NAME:               rswave-akwclj-connector
      KB_CLUSTER_UID_POSTFIX_8:           6dab9643
      KB_POD_FQDN:                        $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
      AWS_ACCESS_KEY_ID:                  ***
      AWS_REGION:                         ***
      AWS_SECRET_ACCESS_KEY:              ***
      RW_DATA_DIRECTORY:                  risingwave-rswave-akwclj
      RW_ETCD_AUTH:                       false
      RW_ETCD_ENDPOINTS:                  etcd-akwclj-etcd.default.svc.cluster.local:2379
      RW_S3_ENDPOINT:                     https://s3.***.amazonaws.com
      RW_STATE_STORE:                     hummock+s3://***
      RW_CONNECTOR_NODE_PROMETHEUS_PORT:  50052
    Mounts:
      /risingwave/config from risingwave-configuration (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hgmr8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  risingwave-configuration:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rswave-akwclj-connector-risingwave-configuration
    Optional:  false
  risingwave-connector-envs:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      rswave-akwclj-connector-risingwave-connector-envs
    Optional:  false
  kube-api-access-hgmr8:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Guaranteed
Node-Selectors:               <none>
Tolerations:                  kb-data=true:NoSchedule
                              node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/instance=rswave-akwclj,apps.kubeblocks.io/component-name=connector
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  11m                   default-scheduler  Successfully assigned default/rswave-akwclj-connector-0 to gke-cicd-gke-test-cicd-gke-test-3328853d-pwrz
  Warning  Unhealthy  9m45s (x6 over 11m)   kubelet            Liveness probe failed: dial tcp 10.0.31.101:50051: connect: connection refused
  Normal   Killing    9m45s (x2 over 10m)   kubelet            Container connector failed liveness probe, will be restarted
  Normal   Pulled     9m15s (x3 over 11m)   kubelet            Container image "ghcr.io/risingwavelabs/risingwave:v1.0.0" already present on machine
  Normal   Created    9m15s (x3 over 11m)   kubelet            Created container connector
  Normal   Started    9m15s (x3 over 11m)   kubelet            Started container connector
  Warning  Unhealthy  6m5s (x24 over 11m)   kubelet            Readiness probe failed: dial tcp 10.0.31.101:50051: connect: connection refused
  Warning  BackOff    69s (x20 over 5m15s)  kubelet            Back-off restarting failed container connector in pod rswave-akwclj-connector-0_default(af502c01-a8dc-4a59-9d77-0a26f80e6035)

Expected behavior
create risingwave cluster success on GKE

Screenshots
image

image

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@JashBook JashBook added kind/bug Something isn't working severity/normal User may encounter the same problem labels Oct 16, 2023
@JashBook JashBook added this to the Release 0.7.0 milestone Oct 16, 2023
@ahjing99
Copy link
Collaborator

Seeing the same error on beta.9

rswave-izvkxg-compactor-0         1/1     Running            0             10m
rswave-izvkxg-compute-0           1/1     Running            0             10m
rswave-izvkxg-connector-0         0/1     CrashLoopBackOff   6 (67s ago)   10m
rswave-izvkxg-frontend-0          1/1     Running            0             10m
rswave-izvkxg-meta-0              1/1     Running            0             10m

➜  ~ k logs rswave-izvkxg-connector-0
2023-10-23 08:23:18,039 INFO  [main] connector.ConnectorService:57 - Server started, listening on 50051
2023-10-23 08:23:19,343 INFO  [main] connector.ConnectorService:64 - Prometheus metrics server started, listening on 50052
➜  ~ k describe rswave-izvkxg-connector-0
error: the server doesn't have a resource type "rswave-izvkxg-connector-0"

@ahjing99 ahjing99 added severity/minor It is better to fix the problem for a better user experience and removed severity/normal User may encounter the same problem labels Oct 23, 2023
@free6om free6om assigned free6om and unassigned shanshanying Oct 23, 2023
@ahjing99 ahjing99 added severity/major Great chance user will encounter the same problem and removed severity/minor It is better to fix the problem for a better user experience labels Oct 27, 2023
@ahjing99
Copy link
Collaborator

this is still failed on 0.7.0-beta.12, cluster status changing from Running to Updating because on pod is keep crash

  `kbcli cluster list rswave-nxdtoh --show-labels  --namespace default `

NAME            NAMESPACE   CLUSTER-DEFINITION   VERSION             TERMINATION-POLICY   STATUS   CREATED-TIME                 LABELS
rswave-nxdtoh   default     risingwave           risingwave-v1.0.0   Halt                          Oct 27,2023 17:17 UTC+0800
cluster_status:Oct
cluster_status:Updating
cluster_status:Updating
^@cluster_status:Updating
cluster_status:Updating
^@cluster_status:Updating
cluster_status:Updating
check cluster status done
cluster_status:Running
check pod status

      `kbcli cluster list-instances rswave-nxdtoh --namespace default `

NAME                        NAMESPACE   CLUSTER         COMPONENT   STATUS    ROLE     ACCESSMODE   AZ              CPU(REQUEST/LIMIT)   MEMORY(REQUEST/LIMIT)   STORAGE   NODE                                                CREATED-TIME
rswave-nxdtoh-compactor-0   default     rswave-nxdtoh   compactor   Running   <none>   <none>       us-central1-c   100m / 100m          1Gi / 1Gi               <none>    gke-yjtest-default-pool-a762d992-6nj9/10.128.0.17   Oct 27,2023 17:17 UTC+0800
rswave-nxdtoh-compute-0     default     rswave-nxdtoh   compute     Running   <none>   <none>       us-central1-c   100m / 100m          1Gi / 1Gi               <none>    gke-yjtest-default-pool-a762d992-pnmx/10.128.0.25   Oct 27,2023 17:17 UTC+0800
rswave-nxdtoh-connector-0   default     rswave-nxdtoh   connector   Running   <none>   <none>       us-central1-c   100m / 100m          1Gi / 1Gi               <none>    gke-yjtest-default-pool-a762d992-6nj9/10.128.0.17   Oct 27,2023 17:17 UTC+0800
rswave-nxdtoh-frontend-0    default     rswave-nxdtoh   frontend    Running   <none>   <none>       us-central1-c   100m / 100m          1Gi / 1Gi               <none>    gke-yjtest-default-pool-a762d992-pnmx/10.128.0.25   Oct 27,2023 17:17 UTC+0800
rswave-nxdtoh-meta-0        default     rswave-nxdtoh   meta        Running   <none>   <none>       us-central1-c   100m / 100m          1Gi / 1Gi               <none>    gke-yjtest-default-pool-a762d992-1r63/10.128.0.16   Oct 27,2023 17:17 UTC+0800
^@check pod status done
check cluster connect
kbcli unsupported engine type: risingwave
check cluster connect
kbcli unsupported engine type: risingwave
describe cluster

      `kbcli cluster describe rswave-nxdtoh --namespace default `

Name: rswave-nxdtoh	 Created Time: Oct 27,2023 17:17 UTC+0800
NAMESPACE   CLUSTER-DEFINITION   VERSION             STATUS     TERMINATION-POLICY
default     risingwave           risingwave-v1.0.0   Updating   Halt

Endpoints:
COMPONENT   MODE        INTERNAL                                                  EXTERNAL
frontend    ReadWrite   rswave-nxdtoh-frontend.default.svc.cluster.local:4567     <none>
                        rswave-nxdtoh-frontend.default.svc.cluster.local:8080
meta        ReadWrite   rswave-nxdtoh-meta.default.svc.cluster.local:5690         <none>
                        rswave-nxdtoh-meta.default.svc.cluster.local:5691
                        rswave-nxdtoh-meta.default.svc.cluster.local:1250
compute     ReadWrite   rswave-nxdtoh-compute.default.svc.cluster.local:5688      <none>
                        rswave-nxdtoh-compute.default.svc.cluster.local:1222
compactor   ReadWrite   rswave-nxdtoh-compactor.default.svc.cluster.local:6660    <none>
                        rswave-nxdtoh-compactor.default.svc.cluster.local:1260
connector   ReadWrite   rswave-nxdtoh-connector.default.svc.cluster.local:50051   <none>
                        rswave-nxdtoh-connector.default.svc.cluster.local:50052

Topology:
COMPONENT   INSTANCE                    ROLE     STATUS    AZ              NODE                                                CREATED-TIME
compactor   rswave-nxdtoh-compactor-0   <none>   Running   us-central1-c   gke-yjtest-default-pool-a762d992-6nj9/10.128.0.17   Oct 27,2023 17:17 UTC+0800
compute     rswave-nxdtoh-compute-0     <none>   Running   us-central1-c   gke-yjtest-default-pool-a762d992-pnmx/10.128.0.25   Oct 27,2023 17:17 UTC+0800
connector   rswave-nxdtoh-connector-0   <none>   Running   us-central1-c   gke-yjtest-default-pool-a762d992-6nj9/10.128.0.17   Oct 27,2023 17:17 UTC+0800
frontend    rswave-nxdtoh-frontend-0    <none>   Running   us-central1-c   gke-yjtest-default-pool-a762d992-pnmx/10.128.0.25   Oct 27,2023 17:17 UTC+0800
meta        rswave-nxdtoh-meta-0        <none>   Running   us-central1-c   gke-yjtest-default-pool-a762d992-1r63/10.128.0.16   Oct 27,2023 17:17 UTC+0800

Resources Allocation:
COMPONENT   DEDICATED   CPU(REQUEST/LIMIT)   MEMORY(REQUEST/LIMIT)   STORAGE-SIZE   STORAGE-CLASS
frontend    false       100m / 100m          1Gi / 1Gi               <none>         <none>
meta        false       100m / 100m          1Gi / 1Gi               <none>         <none>
compute     false       100m / 100m          1Gi / 1Gi               <none>         <none>
compactor   false       100m / 100m          1Gi / 1Gi               <none>         <none>
connector   false       100m / 100m          1Gi / 1Gi               <none>         <none>

Images:
COMPONENT   TYPE        IMAGE
frontend    frontend    ghcr.io/risingwavelabs/risingwave:v1.0.0
meta        meta        ghcr.io/risingwavelabs/risingwave:v1.0.0
compute     compute     ghcr.io/risingwavelabs/risingwave:v1.0.0
compactor   compactor   ghcr.io/risingwavelabs/risingwave:v1.0.0
connector   connector   ghcr.io/risingwavelabs/risingwave:v1.0.0

Show cluster events: kbcli cluster list-events -n default rswave-nxdtoh

➜  ~ k get pod | grep rswave-nxdtoh-connector-0
rswave-nxdtoh-connector-0         0/1     Running    6 (94s ago)     7m35s
➜  ~ k get pod | grep rswave-nxdtoh
rswave-nxdtoh-compactor-0         1/1     Running    0               7m42s
rswave-nxdtoh-compute-0           1/1     Running    0               7m43s
rswave-nxdtoh-connector-0         0/1     Running    6 (100s ago)    7m41s
rswave-nxdtoh-frontend-0          1/1     Running    0               7m41s
rswave-nxdtoh-meta-0              1/1     Running    0               7m42s

@free6om
Copy link
Contributor

free6om commented Nov 1, 2023

need log of the crashed connector pod.

@ahjing99
Copy link
Collaborator

ahjing99 commented Nov 1, 2023

These is no much helpful info in the pod logs

➜  ~ k get pod | grep rswave
rswave-xhimho-compactor-0       1/1     Running            0             7m30s
rswave-xhimho-compute-0         1/1     Running            0             7m30s
rswave-xhimho-connector-0       0/1     CrashLoopBackOff   5 (90s ago)   7m31s
rswave-xhimho-frontend-0        1/1     Running            0             7m29s
rswave-xhimho-frontend-1        1/1     Running            0             7m9s
rswave-xhimho-meta-0            1/1     Running            0             7m29s
➜  ~ k logs rswave-xhimho-connector-0 -f
2023-11-01 02:35:38,544 INFO  [main] connector.ConnectorService:57 - Server started, listening on 50051
2023-11-01 02:35:39,742 INFO  [main] connector.ConnectorService:64 - Prometheus metrics server started, listening on 50052

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug kind/bug Something isn't working severity/major Great chance user will encounter the same problem
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants