-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
windows pods are not reachable on a hybrid Linux/ windows cluster #103
Comments
The issue is that flannel is starting before the external network is created. The work around is to Restart the Flannel pod. The network is created and then flannel is started here: sig-windows-tools/kubeadm/flannel/flannel-overlay.yml Lines 34 to 36 in 1f4abb2
The external network doesn't finish creating before flannel is started causing the bad loop |
Do we restart those pods by running those commands above or deleting pods and having them restart |
They run in a DaemonSet so you can |
so after trying this, it seems like we still have the same core issue. The windows simple webapps running on the windows node are not reachable from the linux machines (curl clusterip, curl svc ip). the linux apps on linux nodes are reachable. The windows container on the windows node can be reached with the windows node IP: port. The new info in the logs is this. `Mode LastWriteTime Length Name d----- 10/9/2020 11:38 AM flannel
Mode LastWriteTime Length Name d----- 10/9/2020 10:46 AM serviceaccount
I1009 12:34:16.131480 7100 main.go:518] Determining IP address of default interface Should I attempt to add the node again? Not sure what to do next. |
One thing I noticed is that our Operating system is 1809 and we have Kubernetes 1.19.2. Do we need kubernetes 1.18 to make this work? |
So we tried to make sure the version of our OS (1809) did not include a September patch that has shown to cause these issues. That being identified in this issue microsoft/Windows-Containers#61 after setting this up we restarted the windows node and made sure the pods where back to running and we still have the same issues as before. Our OS version is now Microsoft Windows [Version 10.0.17763.1397] the logs still show the same issues. We cant access the simple web apps on the windows node through serviceIP or clusterIP but we can get to the running containers on the windows node using the NodeIP:port The windows node is running, all the pods are running. Linux web apps on linux nodes are accessible. Here is contents of the log files. c:\var\logs\kubelet on windows node kubelet.exe.AABRW-KUBER03.OLH_AABRW-KUBER03$.log.ERROR.20201014-120556.zip results of kubectl -n kube-system logs kube-flannel-ds-windows-amd64-fw8k9 Mode LastWriteTime Length Name d----- 10/13/2020 11:53 AM serviceaccount
I1014 12:07:44.490869 10480 main.go:518] Determining IP address of default interface |
I rolled back my servers (on my test cluster) to 10.0.17763.1294 and can say that all windows pods are reachable after reboot.
k8s-ws1809-c will wait until fix without reboot to check that fix is working. |
another related piece of info. with this svc (windows app on windows node) clientportal LoadBalancer 10.110.61.103 10.243.0.39 80:30875/TCP 8d I am not able to curl 10.243.0.39 from the linux master but I CAN curl it from the linux worker. clientportal is one of the apps running on the windows node. with this service (webapp on linux master) frontend LoadBalancer 10.110.169.97 10.243.0.36 80:30889/TCP 32dith another similar scenario, a webapp (frontend) running on the linux master is reachable from the master linux node with curl 10.243.0.36 |
I did some digging on the error messages above and I think there are a few things happening: HNS issue
This is happening because you are passing
Ethernet with Ethernet0 2 as described in the documentation)
The issue is wins.exe splits arguments on spaces: https://github.com/rancher/wins/blob/7c2d5528151cb63355615e1ee02bd59380c1c1e2/cmd/client/process/run.go#L75 This is causing the Only Another aspect of this is this error Flannel Network creation and Failed to listThe creation of the external Network by HNS isn't strictly needed since this PR in flannel went in. This is why after a restarting the flannel pod things start to work, the flannel creates the network attached to the correct adapter ( It seems there is a timing issue with Flannel creating the network though:
Here we see it creating the network (flannel.4096) and starting to list the Kubernetes nodes via the golang client. The flannel network takes some time and causes a network hicup when first creating the VM switch (see this comment). The network blip causes the connection to the apiserver to get into a bad cached state as defined in flannel-io/flannel#1272. Work aroundsBy creating the external network first via HNS you avoid this issue completely because there is not network disconnect during the time of the flannel network creation. One option is during node set up you create the External network before deploying flannel this should resolve the issue. To fix this in the Docker image requires some extra work since it looks like wins.exe is no longer taking issues (issues look to be disabled on the repository). The work around for the arguements being split are not to elegant, either encode the space and decode in the setup binary or pass the value via file to the setup binary. I have a working version which I will clean up some more before submitting: https://github.com/kubernetes-sigs/sig-windows-tools/compare/master...jsturtevant:wait-for-network?expand=1 a PR. Ultimatly the fix should go in to flannel to reset connections properly or what till network is fully stable. There is a long standing open issue in the golang kubernetes client blocker on this issue that could potently fix the issue as well: kubernetes/client-go#374 |
/assign |
looks like the creation of the external network prior was trying to solve this issue but the issue with wins args not parsing properly keeps it around: #37 |
I got connected from folks that work on wins.exe and they are working on a fix. Will open an PR to update this once we have a new package. Fyi - for future wins issues from slack conversation:
|
so we have the latest October CU now installed and we still have the same issue as described above. |
The Windows update was not the issue here. It is flannel-io/flannel#1272. Until Flannel is fixed the wins.exe needs to be updated to be able to create the external networks that have spaces in them like The workaround until Wins.exe or flannel is fixed is to manually create the External network before starting flannel. |
Are there any instructions on how to create the external network before starting flannel? I did try to rename the ethernet adapter to just ethernet2 and that didnt seem to help fix the issue. I noticed that the wins commands that are run refer to a setup.exe or flanneld.exe which dont exist in the /k/flannel folder on the windows node.
|
@llyons the setup.exe source is here: sig-windows-tools/kubeadm/flannel/setup.go Lines 13 to 17 in efb98c3
and includes the powershell to create the external network:
Note that will fail if flannel already created a network on the nic, in which case you will need to remove the flannel network. |
if we setup our CIDR as 192.168.0.0/16 should we try to (first remove existing) then run those powershell commands to setup network in the range of 192.168.0.0/16 and gateway of 192.168.0.1? also if its already setup, we want to remove the flannel.4096 network or the external? It looks like we have External (tied to Ethernet2), nat, flannel.4096 and is this network that we are creating, part of the actual ethernet network allowing connectivity or is this a new network that is being created. If I try to run the above New-HNSNetwork command on the existing ethernet2 adapter, it says it already exists. |
Which cidr are you referring to? I think you will want you node/pod cider to not overlap with the external metric cidr. My understanding is this external network is for creating the vswitch which enables the network connectivity via the adapter. It isn't really needed except for the bug in flannel flannel-io/flannel#1272. @ksubrmnn might be able to explain better.
Yes, This should be on a fresh node or you will need to clean up all the different networks that might have been created. |
I might not have said this properly above. Any help or guidance on this would be appreciated @ksubrmnn If I have this return from Get-HNSNetwork
and then this image for my current adapters How would I want to proceed. Assuming I have the cluster CIDR of 192.168.0.0/16 and metallb also serving up IP addresses from a pool, would I do this? New-HNSNetwork -Type Overlay -AddressPrefix "192.168.0.0/16" -Gateway "192.168.0.1" -Name "External" -AdapterName "Ethernet" -SubnetPolicies @(@{Type = "VSID"; VSID = 9999; }); I am pretty desperate to get the windows portion working. remember we can curl the windows serviceip and cluster ip of pod on windows node from Linux node.. but we cant get that serviceIP or cluster IP exposed outside of the 2 linux cluster nodes. apps running on the linux nodes are exposed and do render outside of the cluster using the serviceip |
We where able to determine that in our configuration it looks like metalLB, which provides an IP from a pool, did not have a speaker in the windows node that prevented the IP provisioned from being accessible from outside. Instead we put the windows containers on the windows node behind a ingress resource and setup a ingress service of type load balancer to handle this. So in essence we are getting access through the ingress controllers running on the linux portion of the cluster. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/lifecycle frozen |
Hi,
we have a custom "baremetal" K8 cluster with 2 linux nodes and 1 windows node. Just trying to get familiar with how it works, etc. We have the nginx ingress controller running and the metalLB loadbalancer running. Both of the ingress and loadbalancer controllers only run on the linux nodes and master. They dont run on the windows node (i was told this wasnt needed). We have deployed a number of linux containers running on the linux nodes and they work. the 2 windows containers we have running on the windows node start and are running but are not reachable on clusterIP or the provisioned serviceip. The containers are running on the windows node since I can see those and even docker exec -it into them. Doing a kubectl get svc i have these values for the clientportal app.
clientportal LoadBalancer 10.110.61.103 10.243.0.39 80:30875/TCP 21h app=clientportal
i am able to get to the app with http://windows-node-ip:30875
but I cant get to the app like this http://10.243.0.39
i cant curl the app from one of the linux nodes using clusterIP or the serviceip. I can with the actual nodeIP
i do notice on the window node in the /var/log/kubelet log file some errors like this.
Kubectl get nodes shows the windows node is ready
kubetcl get pods shows running pods on the windows node
docker container ls on windows nodes shows the containers are running and scheduled.
we did upgrade to 1.19.2 of kubelet and kubeadm but it looks like we have had this issue for some time.
C:\k>kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:38:53Z", GoVersion:"go1.15", Compiler:"gc", Platform:"windows/amd64"}
attached are some of the kubelet log files.
Also the logs from kubectl -n kube-system logs kube-flannel-ds-windows-amd64-vrsg9
`Mode LastWriteTime Length Name
d----- 8/20/2020 4:35 PM serviceaccount
WARNING: The names of some imported commands from the module 'hns' include unapproved verbs that might make them less
discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose
parameter. For a list of approved verbs, type Get-Verb.
Invoke-HnsRequest : @{Error=An adapter was not found. ; ErrorCode=2151350278; Success=False}
At C:\k\flannel\hns.psm1:233 char:16
I1008 10:46:56.667299 8008 main.go:518] Determining IP address of default interface
I1008 10:46:57.507782 8008 main.go:531] Using interface with name Ethernet0 2 and address 10.243.1.202
I1008 10:46:57.507782 8008 main.go:548] Defaulting external address to interface address (10.243.1.202)
I1008 10:46:57.545795 8008 kube.go:119] Waiting 10m0s for node controller to sync
I1008 10:46:57.545795 8008 kube.go:306] Starting kube subnet manager
I1008 10:46:58.551449 8008 kube.go:126] Node controller sync successful
I1008 10:46:58.551449 8008 main.go:246] Created subnet manager: Kubernetes Subnet Manager - aabrw-kuber03
I1008 10:46:58.551449 8008 main.go:249] Installing signal handlers
I1008 10:46:58.551449 8008 main.go:390] Found network config - Backend type: vxlan
I1008 10:46:58.551449 8008 vxlan_windows.go:127] VXLAN config: Name=flannel.4096 MacPrefix=0E-2A VNI=4096 Port=4789 GBP=false
DirectRouting=false
I1008 10:46:58.619205 8008 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [
] [] []} [{Static [{192.168.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 12
3 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{192.168.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}}
E1008 10:46:59.972661 8008 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.243.1.202:50491
->10.243.1.212:6443: wsarecv: An established connection was aborted by the software in your host machine.
E1008 10:46:59.973662 8008 reflector.go:304] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to watch *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=10547296&timeoutSeconds=582&watch=true: http2: no cached connection was
available
E1008 10:47:01.036947 8008 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get`
kubelet.exe.logs.zip
any feedback or guidance would be appreciated.
The text was updated successfully, but these errors were encountered: