Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After enabling hostNetwork on DD Agent, statsd traffic still routed to the old pod's IP until client pods restart #2958

Closed
davidpellcb opened this issue Jun 17, 2024 · 10 comments

Comments

@davidpellcb
Copy link

davidpellcb commented Jun 17, 2024

What happened:

Summary: turning on hostNetwork: true in our DD Agent Daemonset spec results in loss of application metrics, which are sent from client pods to the DD Agent pod on the same node using <node_ip>:8125. Restarting the client pod resolves the issue. Issue does not occur in a legacy K8s cluster that is not on EKS and is using Calico as its networking plugin.

Prior to issue:

DD Agent is running with hostPort: 8125 listening for statsd metrics on 8125. Clients send metrics using the dogstatsd client and set a DD_AGENT_HOST by deriving the host IP during scheduling using the downward API: valueFrom: status.hostIP. tcpdump on the client pod shows traffic going from the client pod's IP to the node's IP on port 8125. tcpdump on the DD Agent pod in this state shows that the traffic on 8125 is coming from the client pod IP and being directed to the Agent pod's IP.

After changing DD Agent to hostNetwork: true:

DD Agent pod IP is now the same as the node IP. tcpdump from the client still shows traffic targeting the node's IP. However, tcpdump on the new DD Agent pod shows traffic to port 8125 still being redirected to the old DD Agent pod's IP which is no longer in use by anything, so the traffic is effectively going into a void.

I set up the same experiment on a legacy non-EKS cluster that is using Calico rather than VPC CNI, and was unable to reproduce this issue.

image

My mental model is like this:

Before hostNetwork when DD Agent is just using hostPort and metrics are able to reach DD Agent from Client:

image

How routing should work after enabling hostNetwork (DD Agent has the same IP as the node):

image

(note: the above does happen after I restart the client)

What's actually happening after I enable hostNetwork (before restarting client):

image

Environment:

  • Kubernetes version (use kubectl version): v1.27.13-eks-3af4770
  • CNI Version: v1.18.1-eksbuild.3
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
@orsenthil
Copy link
Member

I noted these in the slack conversation.

  • When Datadog agent is launched with hostPort: 8125 it is the portmap plugin that is chained with VPC CNI, responsible for establishing the communication route with the agent.
  • The value of valueFrom: status.hostIP used the stats client I expect it to the node-ip (not secondary ENI pod ip of datadog) and the portmap plugin send it to the datadog agent running on the host.
  • VPC CNI doens’t add any iptables rules with hostPort is involved for rerouting any containers. You can verify that in the nodes ip tables. Iptables are involved only when External traffic going outside of VPC.

To simplify this scenario, I tried with a nginx and curl test pods to see if the behavior can be reproduced and notice if there is any bug.

  • Run your server (with hostPort and hostNetwork: false)
apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
spec:
  hostNetwork: false
  containers:
  - name: nginx-container
    image: nginx
    ports:
    - containerPort: 80
      hostPort: 80
      protocol: TCP
  nodeSelector:
    kubernetes.io/hostname: ip-192-168-28-81.us-west-2.compute.internal
  • Connect using the client
apiVersion: v1
kind: Pod
metadata:
  name: curl-pod
spec:
  containers:
  - name: curl-container
    image: curlimages/curl
    env:
    - name: HOST_IP
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    command: ['sh', '-c', 'while true; do curl http://$HOST_IP:80; done;']
  nodeSelector:
    kubernetes.io/hostname: ip-192-168-28-81.us-west-2.compute.internal

We can see the communication through.
We can toggle hostNetwork from false to true, and continue seeing the communication.
We cn verify iptables rules on the node and shouldn’t see anything added by vpc cni.
These were tested on EKS 1.29 , CNI v1.16.0-eksbuild.1 (CNI portmap plugin v1.4.0) , kubelet version v1.29.3-eks-ae9a62a

@davidpellcb
Copy link
Author

I was not able to reproduce the problem with the TCP/nginx example you provided above - everything worked. However, when I did a similar experiment with barebones statsdclient and server on 8127/udp (running in a real environment so I couldn’t use default 8125) I reproduced the issue.

Gist: https://gist.github.com/davidpellcb/48d61a23b3cf24cb38586301cbed8eb0

image

I was also able to reproduce with a statsdclient built from go’s net package directly, rather than from the datadog dogstatsd package.

@orsenthil
Copy link
Member

@davidpellcb - this is an interesting observation. When hostPort / hostNetworking is used, portmap chain plugin is the one involved here. I will check that is happening.

@davidpellcb
Copy link
Author

@orsenthil any chance you were able to reproduce with UDP?

@orsenthil
Copy link
Member

orsenthil commented Jun 28, 2024

hello @davidpellcb ; I think, I have an idea on what is happening.

  • When server pod with hostPort: 8127 but with hostNetwork: false is launched, it is creating a pod, with it's own network namespace and ip address, and when the client connects to the node ip:8127, kubernetes routes the traffic from node's network namespace to the pod's network namespace.
  • When you toggle the pod to ``hostNetwork: false` the pod will share the network namespace of the host. The namespace and interface at which the server is now listening has changed - even though the client is connecting to the same ip address and the port.

So, for the client which was connecting to the node-ip:8127, it is essentially a stale connection now, unless it re-establishes the connection; and that's why restart works.

If you run your server pod in your example, but have a simple client like this.

apiVersion: v1
kind: Pod
metadata:
  name: udp-sender-pod
spec:
  containers:
  - name: netcat
    image: alpine
    env:
    - name: UDP_SERVER_IP
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
    command: ["sh", "-c", "while true; do echo 'test udp connection..' | nc -u -w2 $UDP_SERVER_IP 8127; done;"]
  nodeSelector:
    kubernetes.io/hostname: ip-192-168-55-96.us-west-2.compute.internal

You will notice that the client is able to communicate when the server pod toggles from hostNetwork:false to hostNetwork:true. That's because, each nc will create new connection, and we don't see the stale connection issue. That's probably the reason the previous curl command with http and tcp worked too. It has nothing to do with TCP or UDP.

With the go program, or the datadata stats example, we create the connection only once, and it has held the stale connection at the server's end. If we recreate connections internally when the sending fails, it will auto recover, and behave as the netcat pod above.

Yeah, the toggle of the server pod from hostNetwork: false to hostNetwork: true - is never supposed to work transparently, unless the client handles connection interruption internally and reconnects. That's my explanation of this behavior.

@orsenthil
Copy link
Member

Resolving this issue with the explanation provided above.

Copy link

github-actions bot commented Jul 3, 2024

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

@nabeelpaytrix
Copy link

I'm facing an issue where the hostPort property on any of my pods is ignored (I have only tested on pods spawned by a DaemonSet)

My Datadog DaemonSet has the below ports setup:

    ports:
    - containerPort: 8126
      hostPort: 8126
      name: traceport
      protocol: TCP

And I have VPC CNI plugin v1.18.3-eksbuild.2 installed in my v1.29 EKS cluster. The hostPort property is completely ignored, the host does not listen on this port. Do I need to explicitly enable hostPort recognition within the VPC CNI plugin?

@orsenthil
Copy link
Member

The hostPort property is completely ignored,
This shouldn't happen. There is no special setting for hostport in CNI. In the original issue, it was with stale connecting and needing a restart of datadog daemonset. If your's is a different issue, please file a new one with the details.

@nabeelpaytrix
Copy link

Created a separate issue here: #3079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants