After enabling hostNetwork on DD Agent, statsd traffic still routed to the old pod's IP until client pods restart #2958

davidpellcb · 2024-06-17T16:19:47Z

What happened:

Summary: turning on hostNetwork: true in our DD Agent Daemonset spec results in loss of application metrics, which are sent from client pods to the DD Agent pod on the same node using <node_ip>:8125. Restarting the client pod resolves the issue. Issue does not occur in a legacy K8s cluster that is not on EKS and is using Calico as its networking plugin.

Prior to issue:

DD Agent is running with hostPort: 8125 listening for statsd metrics on 8125. Clients send metrics using the dogstatsd client and set a DD_AGENT_HOST by deriving the host IP during scheduling using the downward API: valueFrom: status.hostIP. tcpdump on the client pod shows traffic going from the client pod's IP to the node's IP on port 8125. tcpdump on the DD Agent pod in this state shows that the traffic on 8125 is coming from the client pod IP and being directed to the Agent pod's IP.

After changing DD Agent to hostNetwork: true:

DD Agent pod IP is now the same as the node IP. tcpdump from the client still shows traffic targeting the node's IP. However, tcpdump on the new DD Agent pod shows traffic to port 8125 still being redirected to the old DD Agent pod's IP which is no longer in use by anything, so the traffic is effectively going into a void.

I set up the same experiment on a legacy non-EKS cluster that is using Calico rather than VPC CNI, and was unable to reproduce this issue.

My mental model is like this:

Before hostNetwork when DD Agent is just using hostPort and metrics are able to reach DD Agent from Client:

How routing should work after enabling hostNetwork (DD Agent has the same IP as the node):

(note: the above does happen after I restart the client)

What's actually happening after I enable hostNetwork (before restarting client):

Environment:

Kubernetes version (use kubectl version): v1.27.13-eks-3af4770
CNI Version: v1.18.1-eksbuild.3
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):

The text was updated successfully, but these errors were encountered:

orsenthil · 2024-06-20T15:27:54Z

I noted these in the slack conversation.

When Datadog agent is launched with hostPort: 8125 it is the portmap plugin that is chained with VPC CNI, responsible for establishing the communication route with the agent.
The value of valueFrom: status.hostIP used the stats client I expect it to the node-ip (not secondary ENI pod ip of datadog) and the portmap plugin send it to the datadog agent running on the host.
VPC CNI doens’t add any iptables rules with hostPort is involved for rerouting any containers. You can verify that in the nodes ip tables. Iptables are involved only when External traffic going outside of VPC.

To simplify this scenario, I tried with a nginx and curl test pods to see if the behavior can be reproduced and notice if there is any bug.

Run your server (with hostPort and hostNetwork: false)

apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
spec:
  hostNetwork: false
  containers:
  - name: nginx-container
    image: nginx
    ports:
    - containerPort: 80
      hostPort: 80
      protocol: TCP
  nodeSelector:
    kubernetes.io/hostname: ip-192-168-28-81.us-west-2.compute.internal

Connect using the client

apiVersion: v1
kind: Pod
metadata:
  name: curl-pod
spec:
  containers:
  - name: curl-container
    image: curlimages/curl
    env:
    - name: HOST_IP
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    command: ['sh', '-c', 'while true; do curl http://$HOST_IP:80; done;']
  nodeSelector:
    kubernetes.io/hostname: ip-192-168-28-81.us-west-2.compute.internal

We can see the communication through.
We can toggle hostNetwork from false to true, and continue seeing the communication.
We cn verify iptables rules on the node and shouldn’t see anything added by vpc cni.
These were tested on EKS 1.29 , CNI v1.16.0-eksbuild.1 (CNI portmap plugin v1.4.0) , kubelet version v1.29.3-eks-ae9a62a

davidpellcb · 2024-06-24T20:36:35Z

I was not able to reproduce the problem with the TCP/nginx example you provided above - everything worked. However, when I did a similar experiment with barebones statsdclient and server on 8127/udp (running in a real environment so I couldn’t use default 8125) I reproduced the issue.

Gist: https://gist.github.com/davidpellcb/48d61a23b3cf24cb38586301cbed8eb0

I was also able to reproduce with a statsdclient built from go’s net package directly, rather than from the datadog dogstatsd package.

orsenthil · 2024-06-24T20:40:43Z

@davidpellcb - this is an interesting observation. When hostPort / hostNetworking is used, portmap chain plugin is the one involved here. I will check that is happening.

davidpellcb · 2024-06-27T18:05:16Z

@orsenthil any chance you were able to reproduce with UDP?

orsenthil · 2024-06-28T05:53:48Z

hello @davidpellcb ; I think, I have an idea on what is happening.

When server pod with hostPort: 8127 but with hostNetwork: false is launched, it is creating a pod, with it's own network namespace and ip address, and when the client connects to the node ip:8127, kubernetes routes the traffic from node's network namespace to the pod's network namespace.
When you toggle the pod to ``hostNetwork: false` the pod will share the network namespace of the host. The namespace and interface at which the server is now listening has changed - even though the client is connecting to the same ip address and the port.

So, for the client which was connecting to the node-ip:8127, it is essentially a stale connection now, unless it re-establishes the connection; and that's why restart works.

If you run your server pod in your example, but have a simple client like this.

apiVersion: v1
kind: Pod
metadata:
  name: udp-sender-pod
spec:
  containers:
  - name: netcat
    image: alpine
    env:
    - name: UDP_SERVER_IP
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    command: ["sh", "-c", "while true; do echo 'test udp connection..' | nc -u -w2 $UDP_SERVER_IP 8127; done;"]
  nodeSelector:
    kubernetes.io/hostname: ip-192-168-55-96.us-west-2.compute.internal

You will notice that the client is able to communicate when the server pod toggles from hostNetwork:false to hostNetwork:true. That's because, each nc will create new connection, and we don't see the stale connection issue. That's probably the reason the previous curl command with http and tcp worked too. It has nothing to do with TCP or UDP.

With the go program, or the datadata stats example, we create the connection only once, and it has held the stale connection at the server's end. If we recreate connections internally when the sending fails, it will auto recover, and behave as the netcat pod above.

Yeah, the toggle of the server pod from hostNetwork: false to hostNetwork: true - is never supposed to work transparently, unless the client handles connection interruption internally and reconnects. That's my explanation of this behavior.

orsenthil · 2024-07-03T02:23:17Z

Resolving this issue with the explanation provided above.

github-actions · 2024-07-03T02:23:32Z

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

nabeelpaytrix · 2024-10-18T16:07:04Z

I'm facing an issue where the hostPort property on any of my pods is ignored (I have only tested on pods spawned by a DaemonSet)

My Datadog DaemonSet has the below ports setup:

    ports:
    - containerPort: 8126
      hostPort: 8126
      name: traceport
      protocol: TCP

And I have VPC CNI plugin v1.18.3-eksbuild.2 installed in my v1.29 EKS cluster. The hostPort property is completely ignored, the host does not listen on this port. Do I need to explicitly enable hostPort recognition within the VPC CNI plugin?

orsenthil · 2024-10-18T16:09:42Z

The hostPort property is completely ignored,
This shouldn't happen. There is no special setting for hostport in CNI. In the original issue, it was with stale connecting and needing a restart of datadog daemonset. If your's is a different issue, please file a new one with the details.

nabeelpaytrix · 2024-10-18T21:59:07Z

Created a separate issue here: #3079

davidpellcb added needs investigation question labels Jun 17, 2024

orsenthil closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After enabling hostNetwork on DD Agent, statsd traffic still routed to the old pod's IP until client pods restart #2958

After enabling hostNetwork on DD Agent, statsd traffic still routed to the old pod's IP until client pods restart #2958

davidpellcb commented Jun 17, 2024 •

edited

Loading

orsenthil commented Jun 20, 2024

davidpellcb commented Jun 24, 2024

orsenthil commented Jun 24, 2024

davidpellcb commented Jun 27, 2024

orsenthil commented Jun 28, 2024 •

edited

Loading

orsenthil commented Jul 3, 2024

github-actions bot commented Jul 3, 2024

nabeelpaytrix commented Oct 18, 2024

orsenthil commented Oct 18, 2024

nabeelpaytrix commented Oct 18, 2024

After enabling hostNetwork on DD Agent, statsd traffic still routed to the old pod's IP until client pods restart #2958

After enabling hostNetwork on DD Agent, statsd traffic still routed to the old pod's IP until client pods restart #2958

Comments

davidpellcb commented Jun 17, 2024 • edited Loading

orsenthil commented Jun 20, 2024

davidpellcb commented Jun 24, 2024

orsenthil commented Jun 24, 2024

davidpellcb commented Jun 27, 2024

orsenthil commented Jun 28, 2024 • edited Loading

orsenthil commented Jul 3, 2024

github-actions bot commented Jul 3, 2024

nabeelpaytrix commented Oct 18, 2024

orsenthil commented Oct 18, 2024

nabeelpaytrix commented Oct 18, 2024

davidpellcb commented Jun 17, 2024 •

edited

Loading

orsenthil commented Jun 28, 2024 •

edited

Loading