Initial support of robot machine #1405

olexiyb · 2024-07-13T20:07:22Z

This is work in progress pull, but at least I was able to add bare metal machine.
This somehow related to #433 and discussion #1311

Tested scenarios

- cilium(tunnel) - works and no issues in cilium status
- cilium(native) - half works, cilium status has issues, I suspect this is due to turned-off routing
- calico, maybe works see Calico and HCC hetznercloud/hcloud-cloud-controller-manager#641

Tasks to implement:

- support vSwitch in subnets
- support setup of robot box
- support of joining of robot box to existing cluster as a worker

It requires understanding some important points

The name of the server in the robot interface is important, you have to set the hostname of the bare machine matching to it

hostnamectl set-hostname <name>

you have to be very patient with the vSwitch configuration. It can take minutes (I had to wait 20 minutes one time) to see it's routed

Steps to run:

Take your customer ID K12345677 from your account this will be hetzner_robot_user and use your actual robot password as `
Use them in kube.tf

module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token

  hcloud_robot_password = var.hcloud_robot_password
  hcloud_robot_user     = var.hcloud_robot_user
  vswitch_id = "53981"
...

I was able to run with rancher, longhorn, v1.28
Once you start the cluster you will notice no routes, due to current limitations

follow and create vSwitch and attach the bare metal machine to it

And cloud configuration you should see vswitch connected

But do not blindly follow, be very careful with the changes you make, one mistake and it won't work
In my case I have had next info in my bare machine

# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:7637 qdisc mq state UP group default qlen 1000

So I have adapted to the next script

ip link add link enp6s0 name enp6s0.4000 type vlan id 4000
ip link set enp6s0.4000 mtu 1400
ip link set dev enp6s0.4000 up
ip addr add 10.5.0.2/16 dev enp6s0.4000
ip route add 10.0.0.0/8 via 10.5.0.1

Or you can also modify if you use ubuntu 20.04 or above vi /etc/netplan/01-netcfg.yaml and add to the end

network:
  version: 2
  renderer: networkd
  ethernets:
    enp6s0:
      addresses:
...
  vlans:
    enp6s0.4000:
      id: 4000
      link: enp6s0
      mtu: 1400
      addresses:
        - 10.5.0.2/16
      routes:
        - to: "10.0.0.0/8"
          via: "10.5.0.1"

And run

netplan generate
netplan apply

As result you should see

root@kt1 ~ # ip route show
default via 65.21.46.193 dev enp6s0 proto static onlink 
10.0.0.0/8 via 10.5.0.1 dev enp6s0.4000 proto static 
10.5.0.0/16 dev enp6s0.4000 proto kernel scope link src 10.5.0.2
root@kt1 ~ # ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:7637 qdisc mq state UP group default qlen 1000
    link/ether xxxxxx  brd ff:ff:ff:ff:ff:ff
    inet xxxxxx/32 scope global enp6s0
       valid_lft forever preferred_lft forever
...
3: enp6s0.4000@enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
    link/ether 9c:6b:00:26:55:47 brd ff:ff:ff:ff:ff:ff
    inet 10.5.0.2/16 brd 10.5.255.255 scope global enp6s0.4000
       valid_lft forever preferred_lft forever
    inet6 fe80::9e6b:ff:fe26:5547/64 scope link 
       valid_lft forever preferred_lft forever

Verify in robot configuration

Now verify that you see the cloud machine from your robot machine

root@kt1 ~ # ping 10.253.0.101
PING 10.253.0.101 (10.253.0.101) 56(84) bytes of data.
64 bytes from 10.253.0.101: icmp_seq=1 ttl=62 time=1.92 ms
64 bytes from 10.253.0.101: icmp_seq=2 ttl=62 time=0.771 ms

BE VERY PATIENT HERE! IT CAN BE SLOW TO PROPAGATE vSwitch connection

The next important thing to put the node name = machine name in robot screen = hostname
See

mkdir -p /etc/rancher/k3s/
cat << EOF > /etc/rancher/k3s/config.yaml
"kubelet-arg":
- "cloud-provider=external"
- "--provider-id=hrobot://2432264"
- "volume-plugin-dir=/var/lib/kubelet/volumeplugins"
- "kube-reserved=cpu=50m,memory=300Mi,ephemeral-storage=1Gi"
- "system-reserved=cpu=250m,memory=300Mi"
"node-ip": "10.5.0.2"
"node-label":
- "k3s_upgrade=true"
"node-name": "kt1"
"node-taint":
- "node.cilium.io/agent-not-ready:NoExecute"
"selinux": true
"server": "https://10.255.0.101:6443"
"token": "<TOKEN_FROM_CONTROL>"
EOF

curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=v1.28 INSTALL_K3S_EXEC='agent ' sh -

If everything is fine you should get the agent joined to the cluster and even more I see it's added to load balancer

At this moment I did not find how to solve access problem to robot machine as it does not have private IP
And I see in logs of hcloud-cloud-controller-manager

node_controller.go:240] error syncing 'kt1': failed to get node modifiers from cloud provider: provided node ip for node "kt1" is not valid: failed to get node address from cloud provider that matches ip: 10.5.0.2, requeuing

use helm to deploy hcloud-cloud-controller-manager

mysticaltech · 2024-07-24T08:56:24Z

data.tf

@@ -1,10 +1,3 @@
-data "github_release" "hetzner_ccm" {


Why is this being removed, this is important as it always grabs the latest and greatest.

mysticaltech · 2024-07-24T09:02:25Z

@olexiyb This is the cleanest proposal I have seen to date. If you choose to only support cilium tunnel and nothing else and it works well, I would be happy.

Just one thing, try to not change the core functionality too much, like with the ccm. If you can built a system that switches over to the needed custom implementation when a key variable is true like the robot related ones, it would be great.

That way it's even easier for maintenance. And the less boilerplate code the better. Also please continue clarifying the docs which is already on track.

Overall, looking very promising 🙏

olexiyb · 2024-07-27T07:33:29Z

@mysticaltech I have started the discussion about the move of hccm to helm approach
I realize this can be very tricky as many people already use of the deployment approach and it will be hard for them to switch to helm
I will revert my pull to use direct deployment for now

olexiyb · 2024-07-27T07:39:13Z

At this moment I was able to create 2 clusters (manage them using rancher)
They were build the identical way, but for some reason I have had issue with ingress controller. The default externalTrafficPolicy: "Cluster" just did not work as expected, so I have had to use custom

  nginx_values = <<EOT
controller:
  watchIngressWithoutClass: "true"
  kind: "DaemonSet"
  config:
    "use-forwarded-headers": "true"
    "compute-full-forwarded-for": "true"
    "use-proxy-protocol": "true"
  service:
    externalTrafficPolicy: "Local"
    annotations:
      "load-balancer.hetzner.cloud/name": "scdev-nginx"
      "load-balancer.hetzner.cloud/use-private-ip": "true"
      "load-balancer.hetzner.cloud/disable-private-ingress": "true"
      "load-balancer.hetzner.cloud/disable-public-network": "false"
      "load-balancer.hetzner.cloud/ipv6-disabled": "false"
      "load-balancer.hetzner.cloud/location": "hel1"
      "load-balancer.hetzner.cloud/type": "lb11"
      "load-balancer.hetzner.cloud/uses-proxyprotocol": "true"
      "load-balancer.hetzner.cloud/algorithm-type": "round_robin"
      "load-balancer.hetzner.cloud/health-check-interval": "15s"
      "load-balancer.hetzner.cloud/health-check-timeout": "10s"
      "load-balancer.hetzner.cloud/health-check-retries": "3"
  EOT

And now all bare metal machines properly join the load balancer!

mysticaltech · 2024-08-09T14:58:15Z

Good to hear @olexiyb

JoeyKhd · 2024-08-19T14:41:17Z

+1 for @olexiyb always

Initial support of robot machine

68ce84f

olexiyb marked this pull request as draft July 13, 2024 21:01

olexiyb added 3 commits July 14, 2024 08:04

improve generation of hcloud secrets and added info in kube.tf.example

5a30d04

add support of vswitch as subnet

fb3fb58

use helm to deploy hcloud-cloud-controller-manager

print server info

4d57d8c

mysticaltech reviewed Jul 24, 2024

View reviewed changes

Merge branch 'master' into robot-support

31107e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support of robot machine #1405

Initial support of robot machine #1405

olexiyb commented Jul 13, 2024 •

edited

Loading

mysticaltech Jul 24, 2024

mysticaltech commented Jul 24, 2024

olexiyb commented Jul 27, 2024 •

edited

Loading

olexiyb commented Jul 27, 2024 •

edited

Loading

mysticaltech commented Aug 9, 2024

JoeyKhd commented Aug 19, 2024

Initial support of robot machine #1405

Are you sure you want to change the base?

Initial support of robot machine #1405

Conversation

olexiyb commented Jul 13, 2024 • edited Loading

mysticaltech Jul 24, 2024

Choose a reason for hiding this comment

mysticaltech commented Jul 24, 2024

olexiyb commented Jul 27, 2024 • edited Loading

olexiyb commented Jul 27, 2024 • edited Loading

mysticaltech commented Aug 9, 2024

JoeyKhd commented Aug 19, 2024

olexiyb commented Jul 13, 2024 •

edited

Loading

olexiyb commented Jul 27, 2024 •

edited

Loading

olexiyb commented Jul 27, 2024 •

edited

Loading