Skip to content

Commit

Permalink
Fix externalTrafficPolicy
Browse files Browse the repository at this point in the history
Signed-off-by: manuelbuil <[email protected]>
  • Loading branch information
manuelbuil committed Oct 21, 2024
1 parent c0d661b commit ea9fb45
Show file tree
Hide file tree
Showing 2 changed files with 217 additions and 15 deletions.
153 changes: 153 additions & 0 deletions docs/adrs/externalTrafficPolicysupport.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Remove svclb daemonset

Date: 2024-10-21

## Status

Approved

## Context

Services of type LoadBalancer have, among others, one extra configuration parameter called externalTrafficPolicy. It can have two different values: cluster or local. The default is cluster and we support it. The problem is with local.

When using externalTrafficPolicy=local, three things must happen:
1 - The client source IP must be preserved
2 - The traffic must stay local, i.e. service must be served by a pod running in the node where the request came
3 - If the request ingresses via a node where no pod server is running, request should fail

Our current klipper-lb solution is unfortunately not honoring any of those requirements. I will demonstrate this.

To make use of klipper-lb pipelines, we must deploy k3s with a configured node-external-ip. That IP must be a natted IP by the cloud provider, so that traffic reaching the node does not include the IP. This is common in public clouds like AWS or Azure, where node-external-ip is the public IP of the instance.

Imagine a deployment of K3s with three nodes:

Node1:
* internal-ip: 10.1.1.12
* exernal-ip: 20.224.75.242

Node2:
* internal-ip: 10.1.1.16
* exernal-ip: 20.224.75.227

Node3:
* internal-ip: 10.1.1.17
* exernal-ip: 20.13.19.83

Now we deploy an application that includes a deployment with 2 replicas. These replicas run on Node1 and Node2. That application includes a Kubernetes service of type LoadBalancer and externalTrafficPolicy=local. This is the service:

```
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
httpbin LoadBalancer 10.43.43.218 20.224.75.227,20.224.75.242 8000:31355/TCP 25m app=httpbin
```
The external-ips are correct since servers are running in Node1 and Node2.

The app is a networking toolset and if we query the url `/ip`, we will get the source IP. If we query the three public IPs:

```
$> curl 20.224.75.227:8000/ip
{
"origin": "10.42.1.6"
}
$> curl 20.224.75.242:8000/ip
{
"origin": "10.42.0.10"
}
$> curl 20.13.19.83:8000/ip
{
"origin": "10.42.3.3"
}
```
Here we can already observe how it does not preserve the source IP. That source IP is the IP of the local svclb pod. The problem is that klipper-lb is applying [MASQUERADE](https://github.com/k3s-io/klipper-lb/blob/master/entry#L72) on the traffic and thus replacing the sourceIP.

Moreover, we can observe how curl is working on Node3, that should not happen because there is no local httpbin pod. The reason for this is because our implementation of svclb deploys a daemonset, i.e. one pod in each node (unless the node has the service port unavailable). Our implementation is correctly showing only 2 external-ips for the service but fails to remove the networking pipelines in the other node, so traffic can reach the service. And it can reach the service because we are not honoring the 2nd requirement: traffic must stay local.

To demonstrate that traffic does not stay local, we must deep dive into iptables. When traffic leaves the svclb, remember the traffic has been masqueraded, so the sourceIP now is the IP of the svclb pod, i.e. it is within the clusterCIDR range, this is important for an upcoming explanation. Apart from that, the destinationIP and the destinationPort are [DNAT using the nodeIP and the nodePort port](https://github.com/k3s-io/k3s/blob/master/pkg/cloudprovider/servicelb.go#L558-L569). We can see this if we do tcpdump on the svclb pod:

```
13:29:38.546244 IP 92.186.228.99.46846 > 10.42.3.3.8000: Flags [S], seq 2377127606, win 64240, options [mss 1460,sackOK,TS val 1653522051 ecr 0,nop,wscale 7], length 0
13:29:38.546268 IP 10.42.3.3.46846 > 10.1.1.17.31355: Flags [S], seq 2377127606, win 64240, options [mss 1460,sackOK,TS val 1653522051 ecr 0,nop,wscale 7], length 0
13:29:38.547651 IP 10.1.1.17.31355 > 10.42.3.3.46846: Flags [S.], seq 4157657780, ack 2377127607, win 64308, options [mss 1410,sackOK,TS val 4233562759 ecr 1653522051,nop,wscale 7], length 0
13:29:38.547659 IP 10.42.3.3.8000 > 92.186.228.99.46846: Flags [S.], seq 4157657780, ack 2377127607, win 64308, options [mss 1410,sackOK,TS val 4233562759 ecr 1653522051,nop,wscale 7], length 0
```

`92.186.228.99` ==> client IP. It's a public IP
`10.42.3.3` ==> It's the IP of the svclb pod running in Node3

In the first packet, we see the request. In the second packet we can see the same packet but after the DNAT and the MASQ. This is what happened:

`92.186.228.99` became `10.42.3.3` because of MASQ. Note that `10.42.3.3` is the IP of svclb interface
`10.42.3.3:8000` became `10.1.1.17:31355` because of the DNAT. `10.1.1.17` is the local nodeIP and `31355` is the nodePort to reach the service, as you can observe above, right before the curl commands.

Because the destination IP is now the local nodeIP, we go back to the root network namespace, and there we are hitting the chain `KUBE-NODEPORTS`:
```
Chain KUBE-NODEPORTS (1 references)
target prot opt source destination
KUBE-EXT-UQMCRMJZLI3FTLDP tcp -- anywhere anywhere /* kube-system/traefik:web */ tcp dpt:32538
KUBE-EXT-CVG3OEGEH7H5P3HQ tcp -- anywhere anywhere /* kube-system/traefik:websecure */ tcp dpt:31718
KUBE-EXT-FREKB6WNWYJLKTHC tcp -- anywhere anywhere /* default/httpbin:http */ tcp dpt:31355
```

Since we are interested in `31355` our chain is the last one:
```
Chain KUBE-EXT-FREKB6WNWYJLKTHC (3 references)
target prot opt source destination
KUBE-SVC-FREKB6WNWYJLKTHC all -- 10.42.0.0/16 anywhere /* pod traffic for default/httpbin:http external destinations */
KUBE-MARK-MASQ all -- anywhere anywhere /* masquerade LOCAL traffic for default/httpbin:http external destinations */ ADDRTYPE match src-type LOCAL
KUBE-SVC-FREKB6WNWYJLKTHC all -- anywhere anywhere /* route LOCAL traffic for default/httpbin:http external destinations */ ADDRTYPE match src-type LOCAL
```

Remember, when we reach this rules, our source IP is in the clusterCIDR range because it is the IP of the local svclb pod. That means we are hitting the first rule `KUBE-SVC-FREKB6WNWYJLKTHC`, which basically load balances, with 0.5 probability, the traffic across the existing two httpbin pods:
```
Chain KUBE-SVC-FREKB6WNWYJLKTHC (3 references)
target prot opt source destination
KUBE-MARK-MASQ tcp -- !10.42.0.0/16 10.43.43.218 /* default/httpbin:http cluster IP */ tcp dpt:8000
KUBE-SEP-RUCF7MA7CMMKMMX3 all -- anywhere anywhere /* default/httpbin:http -> 10.42.0.8:80 */ statistic mode random probability 0.50000000000
KUBE-SEP-LZUMEJMLNJBPTY4D all -- anywhere anywhere /* default/httpbin:http -> 10.42.1.4:80 */
```

In nodes where httpbin pod runs, for example Node2, in KUBE-EXT chain of port `31355`, we see an extra KUBE-SVL chain coming last, which points direclty at the local pod:
```
Chain KUBE-EXT-FREKB6WNWYJLKTHC (3 references)
target prot opt source destination
KUBE-SVC-FREKB6WNWYJLKTHC all -- 10.42.0.0/16 anywhere /* pod traffic for default/httpbin:http external destinations */
KUBE-MARK-MASQ all -- anywhere anywhere /* masquerade LOCAL traffic for default/httpbin:http external destinations */ ADDRTYPE match src-type LOCAL
KUBE-SVC-FREKB6WNWYJLKTHC all -- anywhere anywhere /* route LOCAL traffic for default/httpbin:http external destinations */ ADDRTYPE match src-type LOCAL
KUBE-SVL-FREKB6WNWYJLKTHC all -- anywhere anywhere
```

However, since our source IP is in the clusterCIDR range, we are still going through `KUBE-SVC-FREKB6WNWYJLKTHC` and thus the traffic is distributed across all pods and not honoring the rule of traffic having to stay local.


## Solution 1

First of all, we pass an extra env variable to the svclb pods so that we specify if traffic must be MASQ or not. Klipper-lb will read that variable and based on that it will add the MASQ rule in iptables or not. This way, we will honor the rule that the client source IP must be preserved.

Second of all, we will stop using the nodeIP and the nodePort port when externalTrafficPolicy=local. We will still use the regular ClusterIP and the regular Service port. As we are not masquerading anymore, we will hit the `KUBE-SVL` chain that will apply DNAT pointing at the local pod, thus honoring the traffic must stay local rule. Moreover, in nodes where the pod is not running, as there is no `KUBE-SVL`, traffic will get dropped, hence honoring the last rule.

The problem with this solution is that because we are not masquerading, the traffic never goes back to svclb. Note that in svclb is where we are doing the DNAT nodeIP ==> serviceIP. As a consequence, the DNAT is never reversed, so we see traffic leaving the node with the service IP as source IP:

```
13:53:24.325084 IP 92.186.228.99.40512 > 10.1.1.17.8000: Flags [S], seq 864918624, win 64240, options [mss 1460,sackOK,TS val 1654947829 ecr 0,nop,wscale 7], length 0
13:53:24.325224 IP 10.43.7.136.8000 > 92.186.228.99.40512: Flags [S.], seq 3049898895, ack 864918625, win 64308, options [mss 1410,sackOK,TS val 2852268053 ecr 1654947829,nop,wscale 7], length 0
```

`92.186.228.99` is the client IP
`10.1.1.17` is the node IP
`10.43.7.136` is the service IP

Packets are dropped by public clouds as they are expecting `10.1.1.17.8000` to reply, not `10.43.7.136.8000`.

## Solution 2

First of all, we pass an extra env variable to the svclb pods so that we specify if traffic must be MASQ or not. Klipper-lb will read that variable and based on that it will add the MASQ rule in iptables or not. This way, we will honor the rule that the client source IP must be preserved.

Second of all, we will stop using the nodeIP and the nodePort port when externalTrafficPolicy=local. We will direclty point at the local podIP and the port implementing the service. Flannel would MASQ the traffic when it comes back from the pod using the nodeIP.

## Hack to preserve the client IP

Do not define the externalIPs as node-external-IP but still use it to curl the service. As a consequence, kube-proxy pipelines will intercept the packet and handle it correctly

## Decision

79 changes: 64 additions & 15 deletions pkg/cloudprovider/servicelb.go
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ const (
)

var (
DefaultLBImage = "rancher/klipper-lb:v0.4.9"
DefaultLBImage = "mbuilsuse/klipper-lb:testing1-20241021"
)

func (k *k3s) Register(ctx context.Context,
Expand Down Expand Up @@ -241,6 +241,26 @@ func (k *k3s) updateStatus(namespace, name string) error {
return nil
}

eps, err := k.getEndpoints(svc)
if err != nil {
return err
}

if len(eps) > 0 {
// Modify the labels
if svc.Labels == nil {
svc.Labels = make(map[string]string)
}
logrus.Infof("MANU - This is the pod: %v", eps[0].Endpoints[0].Addresses[0])
logrus.Infof("MANU - This is the target port: %v", svc.Spec.Ports[0].TargetPort.String())
svc.Labels["servicePod"] = eps[0].Endpoints[0].Addresses[0]
svc.Labels["serviceTargetPort"] = svc.Spec.Ports[0].TargetPort.String()
_, err := k.client.CoreV1().Services(namespace).Update(context.TODO(), svc, meta.UpdateOptions{})
if err != nil {
return err
}
}

previousStatus := svc.Status.LoadBalancer.DeepCopy()
newStatus, err := k.getStatus(svc)
if err != nil {
Expand All @@ -255,6 +275,20 @@ func (k *k3s) getDaemonSet(svc *core.Service) (*apps.DaemonSet, error) {
return k.daemonsetCache.Get(k.LBNamespace, generateName(svc))
}

// getEndpoints returns the EndpointSlice for the Service
func (k *k3s) getEndpoints(svc *core.Service) ([]*discovery.EndpointSlice, error) {
if servicehelper.RequestsOnlyLocalTraffic(svc) {
eps, err := k.endpointsCache.List(svc.Namespace, labels.SelectorFromSet(labels.Set{
discovery.LabelServiceName: svc.Name,
}))
if err != nil {
return nil, err
}
return eps, nil
}
return nil, nil
}

// getStatus returns a LoadBalancerStatus listing ingress IPs for all ready pods
// matching the selected service.
func (k *k3s) getStatus(svc *core.Service) (*core.LoadBalancerStatus, error) {
Expand Down Expand Up @@ -435,6 +469,7 @@ func (k *k3s) newDaemonSet(svc *core.Service) (*apps.DaemonSet, error) {
oneInt := intstr.FromInt(1)
priorityClassName := k.getPriorityClassName(svc)
localTraffic := servicehelper.RequestsOnlyLocalTraffic(svc)
logrus.Infof("MANU - This is localTraffic: %v", localTraffic)
sourceRangesSet, err := servicehelper.GetLoadBalancerSourceRanges(svc)
if err != nil {
return nil, err
Expand Down Expand Up @@ -554,33 +589,47 @@ func (k *k3s) newDaemonSet(svc *core.Service) (*apps.DaemonSet, error) {
}

if localTraffic {
logrus.Infof("MANU - Localtraffic is true")
logrus.Infof("MANU - This is DEST_PORT: %v and this is DEST_IPS: %v", port.NodePort, getHostIPsFieldPath())
labelPodKey := "servicePod"
labelTargetPortKey := "serviceTargetPort"
labelPodValue := svc.Labels[labelPodKey]
labelTargetPortValue := svc.Labels[labelTargetPortKey]
container.Env = append(container.Env,
core.EnvVar{
Name: "DEST_PORT",
Value: strconv.Itoa(int(port.NodePort)),
Name: "MASQ",
Value: "false",
},
core.EnvVar{
Name: "DEST_IPS",
ValueFrom: &core.EnvVarSource{
FieldRef: &core.ObjectFieldSelector{
FieldPath: getHostIPsFieldPath(),
},
},
Name: "PODIP",
Value: labelPodValue,
},
core.EnvVar{
Name: "TARGET_PORT",
Value: labelTargetPortValue,
},
)
} else {
logrus.Infof("MANU - Localtraffic is false")
container.Env = append(container.Env,
core.EnvVar{
Name: "DEST_PORT",
Value: strconv.Itoa(int(port.Port)),
},
core.EnvVar{
Name: "DEST_IPS",
Value: strings.Join(svc.Spec.ClusterIPs, ","),
Name: "MASQ",
Value: "true",
},
)
}

container.Env = append(container.Env,
core.EnvVar{
Name: "DEST_PORT",
Value: strconv.Itoa(int(port.Port)),
},
core.EnvVar{
Name: "DEST_IPS",
Value: strings.Join(svc.Spec.ClusterIPs, ","),
},
)

ds.Spec.Template.Spec.Containers = append(ds.Spec.Template.Spec.Containers, container)
}

Expand Down

0 comments on commit ea9fb45

Please sign in to comment.