Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support of robot machine #1405

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

olexiyb
Copy link
Contributor

@olexiyb olexiyb commented Jul 13, 2024

This is work in progress pull, but at least I was able to add bare metal machine.
This somehow related to #433 and discussion #1311

Tested scenarios

Tasks to implement:

  • - support vSwitch in subnets
  • - support setup of robot box
  • - support of joining of robot box to existing cluster as a worker

It requires understanding some important points

  • The name of the server in the robot interface is important, you have to set the hostname of the bare machine matching to it
hostnamectl set-hostname <name>
  • you have to be very patient with the vSwitch configuration. It can take minutes (I had to wait 20 minutes one time) to see it's routed

Steps to run:

  • Take your customer ID K12345677 from your account this will be hetzner_robot_user and use your actual robot password as `
  • Use them in kube.tf
module "kube-hetzner" {
  providers = {
    hcloud = hcloud
  }
  hcloud_token = var.hcloud_token != "" ? var.hcloud_token : local.hcloud_token

  hcloud_robot_password = var.hcloud_robot_password
  hcloud_robot_user     = var.hcloud_robot_user
  vswitch_id = "53981"
...
  • I was able to run with rancher, longhorn, v1.28
    Once you start the cluster you will notice no routes, due to current limitations
image image

follow and create vSwitch and attach the bare metal machine to it
image
And cloud configuration you should see vswitch connected
image

But do not blindly follow, be very careful with the changes you make, one mistake and it won't work
In my case I have had next info in my bare machine

# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:7637 qdisc mq state UP group default qlen 1000

So I have adapted to the next script

ip link add link enp6s0 name enp6s0.4000 type vlan id 4000
ip link set enp6s0.4000 mtu 1400
ip link set dev enp6s0.4000 up
ip addr add 10.5.0.2/16 dev enp6s0.4000
ip route add 10.0.0.0/8 via 10.5.0.1

Or you can also modify if you use ubuntu 20.04 or above vi /etc/netplan/01-netcfg.yaml and add to the end

network:
  version: 2
  renderer: networkd
  ethernets:
    enp6s0:
      addresses:
...
  vlans:
    enp6s0.4000:
      id: 4000
      link: enp6s0
      mtu: 1400
      addresses:
        - 10.5.0.2/16
      routes:
        - to: "10.0.0.0/8"
          via: "10.5.0.1"

And run

netplan generate
netplan apply

As result you should see

root@kt1 ~ # ip route show
default via 65.21.46.193 dev enp6s0 proto static onlink 
10.0.0.0/8 via 10.5.0.1 dev enp6s0.4000 proto static 
10.5.0.0/16 dev enp6s0.4000 proto kernel scope link src 10.5.0.2
root@kt1 ~ # ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:7637 qdisc mq state UP group default qlen 1000
    link/ether xxxxxx  brd ff:ff:ff:ff:ff:ff
    inet xxxxxx/32 scope global enp6s0
       valid_lft forever preferred_lft forever
...
3: enp6s0.4000@enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
    link/ether 9c:6b:00:26:55:47 brd ff:ff:ff:ff:ff:ff
    inet 10.5.0.2/16 brd 10.5.255.255 scope global enp6s0.4000
       valid_lft forever preferred_lft forever
    inet6 fe80::9e6b:ff:fe26:5547/64 scope link 
       valid_lft forever preferred_lft forever

Verify in robot configuration
image

Now verify that you see the cloud machine from your robot machine

root@kt1 ~ # ping 10.253.0.101
PING 10.253.0.101 (10.253.0.101) 56(84) bytes of data.
64 bytes from 10.253.0.101: icmp_seq=1 ttl=62 time=1.92 ms
64 bytes from 10.253.0.101: icmp_seq=2 ttl=62 time=0.771 ms

BE VERY PATIENT HERE! IT CAN BE SLOW TO PROPAGATE vSwitch connection

The next important thing to put the node name = machine name in robot screen = hostname
See

mkdir -p /etc/rancher/k3s/
cat << EOF > /etc/rancher/k3s/config.yaml
"kubelet-arg":
- "cloud-provider=external"
- "--provider-id=hrobot://2432264"
- "volume-plugin-dir=/var/lib/kubelet/volumeplugins"
- "kube-reserved=cpu=50m,memory=300Mi,ephemeral-storage=1Gi"
- "system-reserved=cpu=250m,memory=300Mi"
"node-ip": "10.5.0.2"
"node-label":
- "k3s_upgrade=true"
"node-name": "kt1"
"node-taint":
- "node.cilium.io/agent-not-ready:NoExecute"
"selinux": true
"server": "https://10.255.0.101:6443"
"token": "<TOKEN_FROM_CONTROL>"
EOF
curl -sfL https://get.k3s.io | INSTALL_K3S_CHANNEL=v1.28 INSTALL_K3S_EXEC='agent ' sh -

If everything is fine you should get the agent joined to the cluster and even more I see it's added to load balancer

image

At this moment I did not find how to solve access problem to robot machine as it does not have private IP
And I see in logs of hcloud-cloud-controller-manager

node_controller.go:240] error syncing 'kt1': failed to get node modifiers from cloud provider: provided node ip for node "kt1" is not valid: failed to get node address from cloud provider that matches ip: 10.5.0.2, requeuing

@olexiyb olexiyb marked this pull request as draft July 13, 2024 21:01
@@ -1,10 +1,3 @@
data "github_release" "hetzner_ccm" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being removed, this is important as it always grabs the latest and greatest.

@mysticaltech
Copy link
Collaborator

@olexiyb This is the cleanest proposal I have seen to date. If you choose to only support cilium tunnel and nothing else and it works well, I would be happy.

Just one thing, try to not change the core functionality too much, like with the ccm. If you can built a system that switches over to the needed custom implementation when a key variable is true like the robot related ones, it would be great.

That way it's even easier for maintenance. And the less boilerplate code the better. Also please continue clarifying the docs which is already on track.

Overall, looking very promising 🙏

@olexiyb
Copy link
Contributor Author

olexiyb commented Jul 27, 2024

@mysticaltech I have started the discussion about the move of hccm to helm approach
I realize this can be very tricky as many people already use of the deployment approach and it will be hard for them to switch to helm
I will revert my pull to use direct deployment for now

@olexiyb
Copy link
Contributor Author

olexiyb commented Jul 27, 2024

At this moment I was able to create 2 clusters (manage them using rancher)
They were build the identical way, but for some reason I have had issue with ingress controller. The default externalTrafficPolicy: "Cluster" just did not work as expected, so I have had to use custom

  nginx_values = <<EOT
controller:
  watchIngressWithoutClass: "true"
  kind: "DaemonSet"
  config:
    "use-forwarded-headers": "true"
    "compute-full-forwarded-for": "true"
    "use-proxy-protocol": "true"
  service:
    externalTrafficPolicy: "Local"
    annotations:
      "load-balancer.hetzner.cloud/name": "scdev-nginx"
      "load-balancer.hetzner.cloud/use-private-ip": "true"
      "load-balancer.hetzner.cloud/disable-private-ingress": "true"
      "load-balancer.hetzner.cloud/disable-public-network": "false"
      "load-balancer.hetzner.cloud/ipv6-disabled": "false"
      "load-balancer.hetzner.cloud/location": "hel1"
      "load-balancer.hetzner.cloud/type": "lb11"
      "load-balancer.hetzner.cloud/uses-proxyprotocol": "true"
      "load-balancer.hetzner.cloud/algorithm-type": "round_robin"
      "load-balancer.hetzner.cloud/health-check-interval": "15s"
      "load-balancer.hetzner.cloud/health-check-timeout": "10s"
      "load-balancer.hetzner.cloud/health-check-retries": "3"
  EOT

And now all bare metal machines properly join the load balancer!

@mysticaltech
Copy link
Collaborator

Good to hear @olexiyb

@JoeyKhd
Copy link

JoeyKhd commented Aug 19, 2024

+1 for @olexiyb always

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants