Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hubble relay fails on on make helm-install-without-tls when deploying locally on Kind cluster #851

Open
SRodi opened this issue Oct 11, 2024 · 10 comments
Assignees

Comments

@SRodi
Copy link
Member

SRodi commented Oct 11, 2024

Describe the bug
A clear and concise description of what the bug is.

To Reproduce

make quick-build
make helm-install-without-tls

Once the control plane is deployed:

hubbleRelayPodName=$(kubectl get pods -l app.kubernetes.io/name=hubble-relay -n kube-system -o jsonpath='{.items[*].metadata.name}')
k logs -n kube-system $hubbleRelayPodName -f

see error

level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="hubble-peer.kube-system.svc.kubernetes:80"

Expected behavior
This error should not be present, hubble should run without tls so that can be port forwarded on the local machine

Screenshots

Image

Platform (please complete the following information):

  • OS: WSL2 Ubuntu-24.04
  • Kubernetes Version: Kind (Kubernetes v1.31.0)
  • Host: self-host
  • Retina Version: v0.0.16

Additional context
Related to cilium/cilium#20130

I have tested this in AKS (Kubernetes v1.29.8) and this issue is NOT present

@SRodi SRodi changed the title Hubble relay fails on on make helm-install-without-tls Hubble relay fails on on make helm-install-without-tls when deploying locally on Kind cluster Oct 11, 2024
@GuessWhoSamFoo
Copy link

Managed to get this working on kind 0.24.0 (k8s 1.31):

level=info msg="Starting gRPC health server..." addr=":4222" subsys=hubble-relay
level=info msg="Starting gRPC server..." options="{peerTarget:hubble-peer.kube-system.svc.cluster.local.:80 dialTimeout:5000000000 retryTimeout:30000000000 listenAddress::4245 healthListenAddress::4222 metricsListenAddress: log:0xc0002da540 serverTLSConfig:<nil> insecureServer:true clientTLSConfig:<nil> clusterName:default insecureClient:true observerOptions:[0x1f02b40 0x1f02c20] grpcMetrics:<nil> grpcUnaryInterceptors:[] grpcStreamInterceptors:[]}" subsys=hubble-relay
level=info msg="Received peer change notification" change notification="name:\"kind-control-plane\" address:\"192.168.176.2\" type:PEER_ADDED" subsys=hubble-relay
level=info msg="Received peer change notification" change notification="name:\"kind-worker\" address:\"192.168.176.5\" type:PEER_ADDED" subsys=hubble-relay
level=info msg="Received peer change notification" change notification="name:\"kind-worker2\" address:\"192.168.176.3\" type:PEER_ADDED" subsys=hubble-relay
level=info msg="Received peer change notification" change notification="name:\"kind-worker3\" address:\"192.168.176.4\" type:PEER_ADDED" subsys=hubble-relay
level=info msg=Connecting address="192.168.176.4:4244" hubble-tls=false peer=kind-worker3 subsys=hubble-relay
level=info msg=Connecting address="192.168.176.2:4244" hubble-tls=false peer=kind-control-plane subsys=hubble-relay
level=info msg=Connecting address="192.168.176.3:4244" hubble-tls=false peer=kind-worker2 subsys=hubble-relay
level=info msg=Connecting address="192.168.176.5:4244" hubble-tls=false peer=kind-worker subsys=hubble-relay
NAME              READY   UP-TO-DATE   AVAILABLE   AGE     CONTAINERS         IMAGES                                                                                                  SELECTOR
coredns           2/2     2            2           3h58m   coredns            registry.k8s.io/coredns/coredns:v1.11.1                                                                 k8s-app=kube-dns
hubble-relay      1/1     1            1           3h50m   hubble-relay       mcr.microsoft.com/oss/cilium/hubble-relay:v1.15.0                                                       k8s-app=hubble-relay
hubble-ui         1/1     1            1           3h50m   frontend,backend   mcr.microsoft.com/oss/cilium/hubble-ui:v0.12.2,mcr.microsoft.com/oss/cilium/hubble-ui-backend:v0.12.2   k8s-app=hubble-ui
retina-operator   1/1     1            1           3h50m   retina-operator    ghcr.io/guesswhosamfoo/retina/retina-operator:v0.0.16-116-gecdabdb-linux-amd64                          control-plane=retina-operator

DNS resolves as expected

bash-5.0# nslookup hubble-peer.kube-system.svc.cluster.local
Server:		10.96.0.10
Address:	10.96.0.10#53

Name:	hubble-peer.kube-system.svc.cluster.local
Address: 10.96.158.48

It is worth calling out in the workflow written above

make quick-build
make helm-install-without-tls

make quick-build creates retina images via git describe --tags --always whereas make helm-install-without-tls could default to the latest tag. This can result in image pull errors for retina operator/init which will result in "Failed to create peer client for peers synchronization" for hubble-relay

@timraymond
Copy link
Member

@SRodi can you try this again? I believe it may have been related to #921 but want to confirm.

@SRodi
Copy link
Member Author

SRodi commented Nov 5, 2024

@timraymond I confirm I still see this issue on my Kind cluster.

Image

@timraymond
Copy link
Member

@SRodi okay, thanks--good to know. I'll try to repro on mine.

@timraymond
Copy link
Member

@SRodi I believe the root cause here is this:

retina/Makefile

Lines 417 to 419 in 5552182

LATEST_TAG := $(shell curl -s https://api.github.com/repos/microsoft/retina/releases | jq -r '.[0].name')
HELM_IMAGE_TAG ?= $(LATEST_TAG)
The problem is that you're building images from main, but using tags from the latest release, which are getting pulled from MCR--so you're not using the images you're building. Those likely had the aforementioned issue (#921), which explains the similar presentation.

I don't remember why we need to use the latest GH release here, do you @jimassa ? (found by git blame)

At any rate, it seems that the "developer-friendly" install command here is make quick-deploy-hubble which sets the HELM_IMAGE_TAG to $(TAG)-linux-amd64 which is only derived from git locally.

@timraymond timraymond self-assigned this Nov 7, 2024
@SRodi
Copy link
Member Author

SRodi commented Nov 7, 2024

Hi @timraymond, I am not using tags from the last release. Please see the screenshot I included in my last comment. I've purposefully left the bottom terminal so that you can see the image tag used.

Also, .PHONY: quick-deploy-hubble target uses make helm-install-without-tls, that's why I referenced it here.

retina/Makefile

Lines 559 to 562 in 5552182

.PHONY: quick-deploy-hubble
quick-deploy-hubble:
$(MAKE) helm-uninstall || true
$(MAKE) helm-install-without-tls HELM_IMAGE_TAG=$(TAG)-linux-amd64

Can you please try test it on your Kind cluster?

@timraymond
Copy link
Member

@SRodi I tried following the provided repro steps (with a slight modification for paste-friendliness) again, and I cannot get this to repro:

$  make quick-build > /dev/null 2>&1 && echo "build success" || echo "build failed"
build success
$ make helm-install-without-tls
rm -rf /home/nixos/repos/msft/retina/.certs
hubble config reset tls-client-cert-file;    hubble config reset tls-client-key-file;    hubble config reset tls-ca-cert-files;
hubble config set tls false
hubble config reset tls-server-name
make helm-install-hubble ENABLE_TLS=false
make[1]: Entering directory '/home/nixos/repos/msft/retina'
helm upgrade --install retina ./deploy/hubble/manifests/controller/helm/retina/ \
        --namespace kube-system \
        --set os.windows=true \
        --set operator.enabled=true \
        --set operator.repository=ghcr.io/microsoft/retina/retina-operator \
        --set operator.tag=v0.0.16 \
        --set agent.enabled=true \
        --set agent.repository=ghcr.io/microsoft/retina/retina-agent \
        --set agent.tag=v0.0.16 \
        --set agent.init.enabled=true \
        --set agent.init.repository=ghcr.io/microsoft/retina/retina-init \
        --set agent.init.tag=v0.0.16 \
        --set logLevel=info \
        --set hubble.tls.enabled=false \
        --set hubble.relay.tls.server.enabled=false \
        --set hubble.tls.auto.enabled=false \
        --set hubble.tls.auto.method=cronJob \
        --set hubble.tls.auto.certValidityDuration=1 \
        --set hubble.tls.auto.schedule="*/10 * * * *"
Release "retina" does not exist. Installing it now.
NAME: retina
LAST DEPLOYED: Tue Nov 12 09:54:35 2024
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
make[1]: Leaving directory '/home/nixos/repos/msft/retina'

That said, it does appear that HELM_IMAGE_TAG is being overridden outside the repro steps, so I took the liberty of doing the same to see if I could repro. I loaded the images built locally from tip of main with kind load, then did:

$ HELM_IMAGE_TAG=v0.0.16-152-g4f3dcb5-linux-amd64 make helm-install-without-tls
rm -rf /home/nixos/repos/msft/retina/.certs
hubble config reset tls-client-cert-file;    hubble config reset tls-client-key-file;    hubble config reset tls-ca-cert-files;
hubble config set tls false
hubble config reset tls-server-name
make helm-install-hubble ENABLE_TLS=false
make[1]: Entering directory '/home/nixos/repos/msft/retina'
helm upgrade --install retina ./deploy/hubble/manifests/controller/helm/retina/ \
        --namespace kube-system \
        --set os.windows=true \
        --set operator.enabled=true \
        --set operator.repository=ghcr.io/microsoft/retina/retina-operator \
        --set operator.tag=v0.0.16-152-g4f3dcb5-linux-amd64 \
        --set agent.enabled=true \
        --set agent.repository=ghcr.io/microsoft/retina/retina-agent \
        --set agent.tag=v0.0.16-152-g4f3dcb5-linux-amd64 \
        --set agent.init.enabled=true \
        --set agent.init.repository=ghcr.io/microsoft/retina/retina-init \
        --set agent.init.tag=v0.0.16-152-g4f3dcb5-linux-amd64 \
        --set logLevel=info \
        --set hubble.tls.enabled=false \
        --set hubble.relay.tls.server.enabled=false \
        --set hubble.tls.auto.enabled=false \
        --set hubble.tls.auto.method=cronJob \
        --set hubble.tls.auto.certValidityDuration=1 \
        --set hubble.tls.auto.schedule="*/10 * * * *"
Release "retina" has been upgraded. Happy Helming!
NAME: retina
LAST DEPLOYED: Tue Nov 12 10:05:59 2024
NAMESPACE: kube-system
STATUS: deployed
REVISION: 2
TEST SUITE: None
make[1]: Leaving directory '/home/nixos/repos/msft/retina'

This also succeeds, so I'm not sure how to proceed here. Perhaps there are more details that can help repro this?

@SRodi
Copy link
Member Author

SRodi commented Nov 13, 2024

Thanks for looking into this @timraymond!

The issue is not with the Helm deployment, in fact that is consistently successful. The issue is with the hubble-relay pod, which errors and finally goes into CrashLoopBackOff

level=info msg="Starting gRPC health server..." addr=":4222" subsys=hubble-relay
level=info msg="Starting gRPC server..." options="{peerTarget:hubble-peer.kube-system.svc.cluster.local:80 dialTimeout:5000000000 retryTimeout:30000000000 listenAddress::4245 healthListenAddress::4222 metricsListenAddress: log:0xc0003a4460 serverTLSConfig:<nil> insecu
level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="hubble-peer.kube-system.svc.cluster.local:80"
level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="hubble-peer.kube-system.svc.cluster.local:80"
level=info msg="Stopping server..." subsys=hubble-relay                                                                                                                                                                                                                      
level=info msg="Server stopped" subsys=hubble-relay                                                                                                                                                                                                                          
Stream closed EOF for kube-system/hubble-relay-54ffdcb7ff-c6p4w (hubble-relay) 

@SRodi
Copy link
Member Author

SRodi commented Nov 13, 2024

That said, it does appear that HELM_IMAGE_TAG is being overridden outside the repro steps

@timraymond If you are running on Linux or WSL2 then make quick-deploy-hubble will work fine.

provided the env variables are unset:

unset TAG IMAGE_NAMESPACE

This is what would be executed by make quick-deploy-hubble target:

TAG ?= $(shell git describe --tags --always)
make helm-uninstall || true
make helm-install-without-tls HELM_IMAGE_TAG=$(TAG)-linux-amd64

In your case this would result in the following:

make helm-uninstall
make helm-install-without-tls HELM_IMAGE_TAG=v0.0.16-152-g4f3dcb5-linux-amd6

@timraymond
Copy link
Member

@SRodi Understood--the discussion on helm is so that way I can get a successful repro of the broken environment to debug this. What I'm trying to debug first is a version incompatibility in gRPC. If both Retina and Hubble Relay are both not using TLS, then this is the most likely root cause. Alternatively, can you provide the versions of gRPC used in hubble-relay in your setup and the version of gRPC used in retina? Specific commits of each image could work as well, since I could trace the version dependencies backward from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants