Skip to content

Commit

Permalink
Updates
Browse files Browse the repository at this point in the history
  • Loading branch information
dkeightley committed Oct 2, 2024
1 parent 0d7e1eb commit 22012fc
Show file tree
Hide file tree
Showing 9 changed files with 175 additions and 50 deletions.
25 changes: 17 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,30 +20,39 @@ In this lab session we will aim to complete three lab exercises using a pre-crea

### 1 - Troubleshoot an unknown issue

The issue is known to affect deployments in the `deployment-lab` namespace.
The issue is known to affect changes in the `deployment-lab` namespace, like creating new deployments

- Repro the issue by creating a deployment in the `deployment-lab` namespace
- Repro the issue using the `deployment-lab` namespace
- Investigate further to understand the issue
- Make changtes as need to resolve issues

### 2 - Troubleshoot network issues

Deployments have been created in the `lab` namespace of your downstream cluster
All deployments for this lab have been created in the `lab` namespace of your downstream cluster, make changes as needed to resolve the issues

#### 2a - Network issue
**Note**, there is a test pod available, created by a deployment named `test-pod` that should be used for troubleshooting connectivity

Using a test pod, troubleshoot an issue with a deployment named `lab-a`
#### 2a - Network issue

Note, there is a test pod available, created by a deployment named `test-pod` that can be used
- Troubleshoot a connectivity issue with a service and deployment named `lab-a`

#### 2b - Network issue

- Troubleshoot a connectivity issue with a service and deployment named `lab-b`

#### 2c - Network issue

- Troubleshoot a connectivity issue with a service and deployment named `lab-c`

What can be done to resolve this issue?

#### 2d - Network issue

- The Rancher website (rancher.com) is reported to fail from pods in this environment, why the website failing?

# Bonus Rounds

### 3 - Create a downstream cluster (terraform)
### 4 - Create a downstream cluster (terraform)

Pre-work:
- Clone this repo to your downstream cluster node: `git clone https://github.com/rancherlabs/cfl-summit-lab.git`
Expand All @@ -53,7 +62,7 @@ Pre-work:
2. Initialise the terraform modules: `terraform init`
3. Create the resources: `terraform apply`

### 4 - Deploy an application using Fleet
### 5 - Deploy an application using Fleet

Instructions to deploy a webserver using fleet are provided in this link
* https://github.com/rancherlabs/cfl-summit-lab/blob/main/bonus-deploy-an-application-using-fleet/README.md
Expand Down
165 changes: 149 additions & 16 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@
img {
max-width: 100%;
}

code {
white-space : pre-wrap !important;
}
</style>
</head>
<body>
Expand All @@ -38,8 +40,6 @@

# Webhooks in Kubernetes



--

## What are they?
Expand All @@ -56,7 +56,7 @@

???

A basic explanation
A basic definition of webhooks, more details in the following slides

---

Expand All @@ -81,18 +81,55 @@

class: center, middle

![basic-view](https://miro.medium.com/v2/resize:fit:4800/format:webp/0*rKDzcFeAFWuYsFeg.jpg)
![basic-view](webhook.jpeg)

---

class: center, middle

![detailed-view](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*tFRqBPkv9X4Y8RO7agtcWw.jpeg)
![detailed-view](webhook-detailed.jpeg)

More info on webhooks

https://book-v1.book.kubebuilder.io/beyond_basics/what_is_a_webhook

https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/

---

## Common network issues

---

## Wait, first let's take a look at this

---

## TCP 3 way handshake

![tcphandshake](tcp-handshake.png)

???

Cover the 3-way handshake (SYN, SYN-ACK, ACK)
- Destination starts an app which binds to a port (TCB allocated) and listens
- Source creates a TCB (trans control block), assigns a source port, sends a SYN packet to the destination and port

Come back to this slide if it's useful to reference in the following slides

---

## Full TCP session lifecycle

![tcpsession](tcp-full-session.png)

???

- L/H side shows what we covered in the previous slide, the session is started after the handshake
- In between is the data transmission - what matters
- R/H side is the closure of the session - in a nice polite way
- Not all TCP sessions close this way, often abruptly

---

## Common network issues
Expand All @@ -101,9 +138,9 @@

Ask the audience about their understanding of what these messages mean

--
---

- connection refused
### connection refused

```bash
# kubectl describe pod -n kube-system rke2-canal-zoidberg
Expand All @@ -114,21 +151,117 @@
zapp.brannig.an rke2[2783338]: {"level":"warn","ts":"2024-09-27T11:57:31.335237-0400","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xcd34db33ff/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
```

--
???

- i/o timeout
Can be a few reasons:
- (most common) The destination port is not bound by the destination host
- The destination kernel refuses connections due to a backlog of queued connections
- A firewall rule with a `REJECT` rather than `DROP`

--
This can be temporary, for example if ingress-nginx is restarting, the port will not be bound for a short period

---

- connection reset by peer
### i/o timeout

--
```bash
# curl localhost:8080
curl: (28) Connection timed out after 2005 milliseconds
```

- no route to host
```
[ERROR] plugin/errors: 2 3994503566595593402.4565890997905689978. HINFO: read udp 10.42.2.188:45439->10.17.130.43:53: i/o timeout
```

--
```
E0912 19:08:00.809037 1 run.go:74] "command failed" err="unable to load configmap based request-header-client-ca-file: Get \"https://127.0.0.1:6443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication\": dial tcp 127.0.0.1:6443: i/o timeout"
```

```
2024-07-21T14:36:07.543416281Z E0721 14:36:07.543243 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
```

???

Can be a few reasons:
- No ACK response from the destination host, need more context about the connectivity path to understand cause
- Firewall rule with `DROP`, doesn't respond with ACK to the source SYN packet
- Destination host is under load, app doesn't reply in the timeout period

---

### connection reset by peer

```
philip-j-fry.com rancher-system-agent[14615]: time="2024-09-27T20:07:43-05:00" level=fatal msg="[K8s] encountered an error while attempting to update the secret: Put \"https://leela.bender.com/api/v1/namespaces/fleet-default/secrets/custom-8b78ea0e6d6d-machine-plan\": read tcp 10.47.248.198:59390->10.47.130.35:443: read: connection reset by peer"
```

```
ERROR: https://prof-farmsworth.edu/ping is not accessible (Recv failure: Connection reset by peer)
```

```
2024/07/26 14:46:51 [error] 29#29: *283832 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.16.224.253, server: hermes-conrad.jm, request: "GET /apis/snapshot.storage.k8s.io/v1beta1?timeout=32s HTTP/1.1", upstream: "http://10.42.4.181:80/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s", host: "hermes-conrad.jm"
```

???

Equivalent to handing up the phone on a caller, an abrupt closure of the TCP session - almost always by the destination but can be from the source as well
- A TCP packet was sent by the destination with the RST (reset) flag set, indicating a forced immediate closure
- More polite than not sending anything (a timeout), can give more context for troubleshooting

---

### no route to host

```
2024-06-25T18:57:02.136622425-05:00 stderr F W0625 23:57:02.136398 1 egress_controller.go:1001] Failed to start watch for EgressGroup: Get "https://10.43.131.109:443/apis/controlplane.antrea.io/v1beta2/egressgroups?fieldSelector=nodeName%3Dsomething-47024a6c-xdrfv&watch=true": dial tcp 10.43.131.109:443: connect: no route to host
```

```
[ERROR] plugin/errors: 2 3641072525830743004.8496191176616642290. HINFO: read udp 10.42.1.253:43929->8.8.8.8:53: read: no route to host
```

```
2024/04/05 03:18:05 [error] 2681#2681: *2305048 connect() failed (113: No route to host) while connecting to upstream, client: 10.2.176.17, server: anchovies-on-pizza.it, request: "GET /hello HTTP/2.0", upstream: "http://10.42.96.250:1234/hello", host: "anchovies-on-pizza.it"
```


???

Uncommon but can be , some causes:
- A genuine issue with routes in the OS main route table or pod network sandbox
- A firewall rule with a REJECT type that misleads source clients, firewall commonly adds rules with `--reject-with icmp-host-prohibited`

---

### dns failure

```
Oct 17 20:36:18 old-bessie-1 rke2[12378]: time="2022-10-17T20:36:18Z" level=warning msg="Failed to get image from endpoint: Get \"https://planet.express.com/v2/\": dial tcp: lookup planet.express.com: i/o timeout"
```

```
Post "http://api.prod.domain.local/admin": dial tcp: lookup api.prod.domain.local: no such host
```

```
Caused by: java.net.UnknownHostException: foo.bar.com
at java.net.InetAddress.getAllByName0(InetAddress.java:1281) ~[?:1.8.0_211]
at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_211]
[...]
```

???

- The first code block is interesting, golang logs can be a bit misleading, the key word here is `lookup`, this indicates an i/o timeout due to the DNS lookup not resolving. Also the hostname is used, if DNS is successful a destination IP is reported

Lots of potential causes:
- Try to triangulate, if the issue is affecting pods, try to determine if it's internal vs external or both
- Based on the above, focus on the key areas:
- For external, checking coredns logs is often a useful first step, and verifying from another host on the network
- For internal, checking against each coredns pod (endpoint) to eliminate overlay pod/overlay issues

- dns failure

</textarea>
<script src="https://remarkjs.com/downloads/remark-latest.min.js">
Expand Down
Binary file added docs/tcp-full-session.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/tcp-handshake.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/webhook-detailed.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/webhook.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions instructors/lab-deployments/instructor-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,6 @@ A network policy denies traffic to the pod label

A cusom app resets connections

- lab-c -- no route to host
- lab-d -- dns failure
- lab-e
- lab-d -- no route to host, dns failure

A firewall rule is rejecting traffic
23 changes: 0 additions & 23 deletions instructors/lab-deployments/lab-2c-container-image/app.py

This file was deleted.

6 changes: 6 additions & 0 deletions instructors/pre-work/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,12 @@ module "downstream_nodes" {
done
sleep 2
${rancher2_cluster.imported-cluster[count.index].cluster_registration_token[0].insecure_command}
### Used in lab 2d
for ip in $(dig rancher.com +short)
do
iptables -I FORWARD -d $ip -j REJECT --reject-with icmp-host-prohibited
done
END
}

Expand Down

0 comments on commit 22012fc

Please sign in to comment.