-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After enabling hostNetwork on DD Agent, statsd traffic still routed to the old pod's IP until client pods restart #2958
Comments
I noted these in the slack conversation.
To simplify this scenario, I tried with a nginx and curl test pods to see if the behavior can be reproduced and notice if there is any bug.
We can see the communication through. |
I was not able to reproduce the problem with the TCP/nginx example you provided above - everything worked. However, when I did a similar experiment with barebones statsdclient and server on 8127/udp (running in a real environment so I couldn’t use default 8125) I reproduced the issue. Gist: https://gist.github.com/davidpellcb/48d61a23b3cf24cb38586301cbed8eb0 I was also able to reproduce with a statsdclient built from go’s |
@davidpellcb - this is an interesting observation. When |
@orsenthil any chance you were able to reproduce with UDP? |
hello @davidpellcb ; I think, I have an idea on what is happening.
So, for the client which was connecting to the node-ip:8127, it is essentially a stale connection now, unless it re-establishes the connection; and that's why restart works. If you run your server pod in your example, but have a simple client like this.
You will notice that the client is able to communicate when the server pod toggles from hostNetwork:false to hostNetwork:true. That's because, each nc will create new connection, and we don't see the stale connection issue. That's probably the reason the previous With the go program, or the datadata stats example, we create the connection only once, and it has held the stale connection at the server's end. If we recreate connections internally when the sending fails, it will auto recover, and behave as the netcat pod above. Yeah, the toggle of the server pod from hostNetwork: false to hostNetwork: true - is never supposed to work transparently, unless the client handles connection interruption internally and reconnects. That's my explanation of this behavior. |
Resolving this issue with the explanation provided above. |
This issue is now closed. Comments on closed issues are hard for our team to see. |
I'm facing an issue where the My Datadog DaemonSet has the below ports setup:
And I have VPC CNI plugin v1.18.3-eksbuild.2 installed in my |
|
Created a separate issue here: #3079 |
What happened:
Summary: turning on
hostNetwork: true
in our DD Agent Daemonset spec results in loss of application metrics, which are sent from client pods to the DD Agent pod on the same node using<node_ip>:8125
. Restarting the client pod resolves the issue. Issue does not occur in a legacy K8s cluster that is not on EKS and is using Calico as its networking plugin.Prior to issue:
DD Agent is running with
hostPort: 8125
listening for statsd metrics on 8125. Clients send metrics using the dogstatsd client and set a DD_AGENT_HOST by deriving the host IP during scheduling using the downward API:valueFrom: status.hostIP
. tcpdump on the client pod shows traffic going from the client pod's IP to the node's IP on port 8125. tcpdump on the DD Agent pod in this state shows that the traffic on 8125 is coming from the client pod IP and being directed to the Agent pod's IP.After changing DD Agent to
hostNetwork: true
:DD Agent pod IP is now the same as the node IP. tcpdump from the client still shows traffic targeting the node's IP. However, tcpdump on the new DD Agent pod shows traffic to port 8125 still being redirected to the old DD Agent pod's IP which is no longer in use by anything, so the traffic is effectively going into a void.
I set up the same experiment on a legacy non-EKS cluster that is using Calico rather than VPC CNI, and was unable to reproduce this issue.
My mental model is like this:
Before
hostNetwork
when DD Agent is just usinghostPort
and metrics are able to reach DD Agent from Client:How routing should work after enabling
hostNetwork
(DD Agent has the same IP as the node):(note: the above does happen after I restart the client)
What's actually happening after I enable
hostNetwork
(before restarting client):Environment:
kubectl version
):v1.27.13-eks-3af4770
v1.18.1-eksbuild.3
cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: