From 22012fcdc1ba2a93a32b4c317e7be799f806ad59 Mon Sep 17 00:00:00 2001 From: dkeightley Date: Wed, 2 Oct 2024 16:41:17 +1300 Subject: [PATCH] Updates --- README.md | 25 ++- docs/index.html | 165 ++++++++++++++++-- docs/tcp-full-session.png | Bin 0 -> 109395 bytes docs/tcp-handshake.png | Bin 0 -> 32506 bytes docs/webhook-detailed.jpeg | Bin 0 -> 419432 bytes docs/webhook.jpeg | Bin 0 -> 47042 bytes .../lab-deployments/instructor-notes.md | 6 +- .../lab-2c-container-image/app.py | 23 --- instructors/pre-work/main.tf | 6 + 9 files changed, 175 insertions(+), 50 deletions(-) create mode 100644 docs/tcp-full-session.png create mode 100644 docs/tcp-handshake.png create mode 100644 docs/webhook-detailed.jpeg create mode 100644 docs/webhook.jpeg delete mode 100644 instructors/lab-deployments/lab-2c-container-image/app.py diff --git a/README.md b/README.md index 9322ba6..2dd49f7 100644 --- a/README.md +++ b/README.md @@ -20,30 +20,39 @@ In this lab session we will aim to complete three lab exercises using a pre-crea ### 1 - Troubleshoot an unknown issue -The issue is known to affect deployments in the `deployment-lab` namespace. +The issue is known to affect changes in the `deployment-lab` namespace, like creating new deployments - - Repro the issue by creating a deployment in the `deployment-lab` namespace + - Repro the issue using the `deployment-lab` namespace - Investigate further to understand the issue + - Make changtes as need to resolve issues ### 2 - Troubleshoot network issues -Deployments have been created in the `lab` namespace of your downstream cluster +All deployments for this lab have been created in the `lab` namespace of your downstream cluster, make changes as needed to resolve the issues -#### 2a - Network issue +**Note**, there is a test pod available, created by a deployment named `test-pod` that should be used for troubleshooting connectivity -Using a test pod, troubleshoot an issue with a deployment named `lab-a` +#### 2a - Network issue -Note, there is a test pod available, created by a deployment named `test-pod` that can be used +- Troubleshoot a connectivity issue with a service and deployment named `lab-a` #### 2b - Network issue +- Troubleshoot a connectivity issue with a service and deployment named `lab-b` + #### 2c - Network issue +- Troubleshoot a connectivity issue with a service and deployment named `lab-c` + +What can be done to resolve this issue? + #### 2d - Network issue +- The Rancher website (rancher.com) is reported to fail from pods in this environment, why the website failing? + # Bonus Rounds -### 3 - Create a downstream cluster (terraform) +### 4 - Create a downstream cluster (terraform) Pre-work: - Clone this repo to your downstream cluster node: `git clone https://github.com/rancherlabs/cfl-summit-lab.git` @@ -53,7 +62,7 @@ Pre-work: 2. Initialise the terraform modules: `terraform init` 3. Create the resources: `terraform apply` -### 4 - Deploy an application using Fleet +### 5 - Deploy an application using Fleet Instructions to deploy a webserver using fleet are provided in this link * https://github.com/rancherlabs/cfl-summit-lab/blob/main/bonus-deploy-an-application-using-fleet/README.md diff --git a/docs/index.html b/docs/index.html index d239d9c..27a94a5 100644 --- a/docs/index.html +++ b/docs/index.html @@ -17,7 +17,9 @@ img { max-width: 100%; } - + code { + white-space : pre-wrap !important; + } @@ -38,8 +40,6 @@ # Webhooks in Kubernetes - - -- ## What are they? @@ -56,7 +56,7 @@ ??? -A basic explanation +A basic definition of webhooks, more details in the following slides --- @@ -81,18 +81,55 @@ class: center, middle -![basic-view](https://miro.medium.com/v2/resize:fit:4800/format:webp/0*rKDzcFeAFWuYsFeg.jpg) +![basic-view](webhook.jpeg) --- class: center, middle -![detailed-view](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*tFRqBPkv9X4Y8RO7agtcWw.jpeg) +![detailed-view](webhook-detailed.jpeg) More info on webhooks https://book-v1.book.kubebuilder.io/beyond_basics/what_is_a_webhook +https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/ + +--- + +## Common network issues + +--- + +## Wait, first let's take a look at this + +--- + +## TCP 3 way handshake + +![tcphandshake](tcp-handshake.png) + +??? + +Cover the 3-way handshake (SYN, SYN-ACK, ACK) +- Destination starts an app which binds to a port (TCB allocated) and listens +- Source creates a TCB (trans control block), assigns a source port, sends a SYN packet to the destination and port + +Come back to this slide if it's useful to reference in the following slides + +--- + +## Full TCP session lifecycle + +![tcpsession](tcp-full-session.png) + +??? + +- L/H side shows what we covered in the previous slide, the session is started after the handshake +- In between is the data transmission - what matters +- R/H side is the closure of the session - in a nice polite way +- Not all TCP sessions close this way, often abruptly + --- ## Common network issues @@ -101,9 +138,9 @@ Ask the audience about their understanding of what these messages mean --- +--- -- connection refused +### connection refused ```bash # kubectl describe pod -n kube-system rke2-canal-zoidberg @@ -114,21 +151,117 @@ zapp.brannig.an rke2[2783338]: {"level":"warn","ts":"2024-09-27T11:57:31.335237-0400","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xcd34db33ff/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""} ``` --- +??? -- i/o timeout +Can be a few reasons: +- (most common) The destination port is not bound by the destination host +- The destination kernel refuses connections due to a backlog of queued connections +- A firewall rule with a `REJECT` rather than `DROP` --- +This can be temporary, for example if ingress-nginx is restarting, the port will not be bound for a short period + +--- -- connection reset by peer +### i/o timeout --- +```bash +# curl localhost:8080 +curl: (28) Connection timed out after 2005 milliseconds +``` -- no route to host +``` +[ERROR] plugin/errors: 2 3994503566595593402.4565890997905689978. HINFO: read udp 10.42.2.188:45439->10.17.130.43:53: i/o timeout +``` --- +``` +E0912 19:08:00.809037 1 run.go:74] "command failed" err="unable to load configmap based request-header-client-ca-file: Get \"https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication\": dial tcp 127.0.0.1:6443: i/o timeout" +``` + +``` +2024-07-21T14:36:07.543416281Z E0721 14:36:07.543243 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) +``` + +??? + +Can be a few reasons: +- No ACK response from the destination host, need more context about the connectivity path to understand cause +- Firewall rule with `DROP`, doesn't respond with ACK to the source SYN packet +- Destination host is under load, app doesn't reply in the timeout period + +--- + +### connection reset by peer + +``` +philip-j-fry.com rancher-system-agent[14615]: time="2024-09-27T20:07:43-05:00" level=fatal msg="[K8s] encountered an error while attempting to update the secret: Put \"https://leela.bender.com/api/v1/namespaces/fleet-default/secrets/custom-8b78ea0e6d6d-machine-plan\": read tcp 10.47.248.198:59390->10.47.130.35:443: read: connection reset by peer" +``` + +``` +ERROR: https://prof-farmsworth.edu/ping is not accessible (Recv failure: Connection reset by peer) +``` + +``` +2024/07/26 14:46:51 [error] 29#29: *283832 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.224.253, server: hermes-conrad.jm, request: "GET /apis/snapshot.storage.k8s.io/v1beta1?timeout=32s HTTP/1.1", upstream: "http://10.42.4.181:80/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s", host: "hermes-conrad.jm" +``` + +??? + +Equivalent to handing up the phone on a caller, an abrupt closure of the TCP session - almost always by the destination but can be from the source as well +- A TCP packet was sent by the destination with the RST (reset) flag set, indicating a forced immediate closure +- More polite than not sending anything (a timeout), can give more context for troubleshooting + +--- + +### no route to host + +``` +2024-06-25T18:57:02.136622425-05:00 stderr F W0625 23:57:02.136398 1 egress_controller.go:1001] Failed to start watch for EgressGroup: Get "https://10.43.131.109:443/apis/controlplane.antrea.io/v1beta2/egressgroups?fieldSelector=nodeName%3Dsomething-47024a6c-xdrfv&watch=true": dial tcp 10.43.131.109:443: connect: no route to host +``` + +``` +[ERROR] plugin/errors: 2 3641072525830743004.8496191176616642290. HINFO: read udp 10.42.1.253:43929->8.8.8.8:53: read: no route to host +``` + +``` +2024/04/05 03:18:05 [error] 2681#2681: *2305048 connect() failed (113: No route to host) while connecting to upstream, client: 10.2.176.17, server: anchovies-on-pizza.it, request: "GET /hello HTTP/2.0", upstream: "http://10.42.96.250:1234/hello", host: "anchovies-on-pizza.it" +``` + + +??? + +Uncommon but can be , some causes: +- A genuine issue with routes in the OS main route table or pod network sandbox +- A firewall rule with a REJECT type that misleads source clients, firewall commonly adds rules with `--reject-with icmp-host-prohibited` + +--- + +### dns failure + +``` +Oct 17 20:36:18 old-bessie-1 rke2[12378]: time="2022-10-17T20:36:18Z" level=warning msg="Failed to get image from endpoint: Get \"https://planet.express.com/v2/\": dial tcp: lookup planet.express.com: i/o timeout" +``` + +``` +Post "http://api.prod.domain.local/admin": dial tcp: lookup api.prod.domain.local: no such host +``` + +``` +Caused by: java.net.UnknownHostException: foo.bar.com + at java.net.InetAddress.getAllByName0(InetAddress.java:1281) ~[?:1.8.0_211] + at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_211] + [...] +``` + +??? + +- The first code block is interesting, golang logs can be a bit misleading, the key word here is `lookup`, this indicates an i/o timeout due to the DNS lookup not resolving. Also the hostname is used, if DNS is successful a destination IP is reported + +Lots of potential causes: +- Try to triangulate, if the issue is affecting pods, try to determine if it's internal vs external or both +- Based on the above, focus on the key areas: + - For external, checking coredns logs is often a useful first step, and verifying from another host on the network + - For internal, checking against each coredns pod (endpoint) to eliminate overlay pod/overlay issues -- dns failure