Skip to content

Latest commit

 

History

History
236 lines (178 loc) · 8.55 KB

README.md

File metadata and controls

236 lines (178 loc) · 8.55 KB

home-ops

This repo contains all the stuff around my home setup

Hacks

These section is mostly notes for myself, I've put them at the top so they stick out.

  • I needed this patch to the latest Nvidia 470 branch drivers to get them to build for kernel 6.10
  • Only ROCM 6.2 (latest) would build for kernel 6.10 (sort of)

Services

It varies over time, but the hosted services are more less:

  • Home Assistant (running on the official Odroid or whatever it is appliance)
  • MQTT broker
  • Various bespoke home automation code for my hacky projects (sprinklers, aircons, heater, LED string lights)
  • CCTV cameras presently using initialed85/cameranator but soon initialed85/camry
  • SMB server and file browser for the scanner
  • Minecraft server
  • Quake with WebSocket multiplayer server
  • Various other projects / games that I've written

Nodes

I've got a mixture of old x86 laptops, one desktop, one server and two Raspberry Pis- some of the nodes have GPUs.

  • OS and drivers
  • High-availability
    • Keepalived (provide a single presence for the nodes to my router)
  • Storage
    • Dedicated ZFS / NFS file server
  • Deployment infrastructure
    • Kubernetes (K3s)

You may wonder- why the high-availability approach yet the single point of failure of the file server?

That boils down to the following:

  • Prior versions of my home cluster used Ceph and then Longhorn (for clustered storage) a few times, all were fine until they failed catastrophically and (more or less) unrecoverably
  • The server I've picked as the file server has shown to be the most reliable node
  • Prior versions of my home cluster with a single master node caused too many outages when that node crashed (so with HA, my pods will reschedule)

High availability

My router passes on any of the ports I want to expose to 192.168.137.10 which is a virtual IP, handled by Keepalived.

My Keepalived config looks (approximately) like this:

vrrp_instance VI_1 {
    interface eth0
    state MASTER  # only one node is MASTER, the rest are BACKUP
    virtual_router_id 51  # all nodes have the same virtual router ID
    priority 100  # the master has the numerically highest priority, all nodes have unique priorities for determinism
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass some-password
    }
    unicast_peer {
        # 192.168.137.34  # don't have yourself as a peer
        192.168.137.30
        192.168.137.27
        192.168.137.28
        192.168.137.29
    }
    virtual_ipaddress {
        192.168.137.10
    }
}

Storage

In light of the prior failures, I now dedicate one of the nodes as a file server, running ZFS for two pools (HDD and SDD) and exposing that via NFS (which is then consumed as PersistentVolumeClaims in Kubernetes).

ZFS provisioning was something like this (once all drives had been freshly partition w/ Linux partitions):

sudo zpool create -o ashift=12 storage-hdd /dev/sdb /dev/sde /dev/sdh
sudo zpool create -o ashift=12 storage-ssd /dev/sdf1 /dev/sdg1

This is a RAID0 setup btw, so maximum storage (and I think speed?) and zero redundancy- my drives are slow and garbage and small and my data is unimportant so this gives me what I need.

I can't recall the exact commands I ran to install the NFS server (pretty standard stuff though), but /etc/exports looks like this:

/storage-ssd *(insecure,sync,rw,no_subtree_check,no_root_squash,async)
/storage-hdd *(insecure,sync,rw,no_subtree_check,no_root_squash,async)

We haven't yet deployed Argo, but once we do, it'll deploy the pieces necessary to expose NFS for use by PersistentVolumeClaims.

Kubernetes provisioning

Cluster

I ran this to get the K3s cluster running:

# first node (ocnus)
curl -sfL https://get.k3s.io | K3S_TOKEN=some-token K3S_KUBECONFIG_MODE=644 sh -s - server --cluster-init

# subsequent nodes
curl -sfL https://get.k3s.io | K3S_TOKEN=some-token K3S_KUBECONFIG_MODE=644 sh -s - server --server https://192.168.137.34:6443

# on the low-spec nodes
curl -sfL https://get.k3s.io | K3S_TOKEN=some-token K3S_KUBECONFIG_MODE=644 sh -s - agent --server https://192.168.137.34:6443

| FYI, /etc/rancher/k3s/k3s/yaml on the first node is basically ~/.kube/config, it just needs to have the IP changed after copying before you can use it on your own workstation.

I made sure to have the following at /var/lib/rancher/k3s/server/manifests/traefik-config.yaml on each node:

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: traefik
  namespace: kube-system
spec:
  valuesContent: |-
    dashboard:
      enabled: true
    podAnnotations:
      prometheus.io/port: "8082"
      prometheus.io/scrape: "true"
    providers:
      kubernetesIngress:
        publishedService:
          enabled: true
        allowEmptyServices:
          enabled: true
        allowExternalNameServices:
          enabled: true
    priorityClassName: "system-cluster-critical"
    image:
      name: "rancher/mirrored-library-traefik"
      tag: "2.9.4"
    tolerations:
    - key: "CriticalAddonsOnly"
      operator: "Exists"
    - key: "node-role.kubernetes.io/control-plane"
      operator: "Exists"
      effect: "NoSchedule"
    - key: "node-role.kubernetes.io/master"
      operator: "Exists"
      effect: "NoSchedule"
    service:
      ipFamilyPolicy: "PreferDualStack"

The only difference from standard in the changes above is allowEmptyServices: true and allowExternalNameServices: true.

Then I ensured I had a /etc/rancher/k3s/registries.yaml on each node that read as follows:

mirrors:
  "kube-registry:5000":
    endpoint:
      - "http://kube-registry:5000"
configs:
  "kube-registry:5000":
    tls:
      insecure_skip_verify: true

This is coupled with an /etc/hosts entry on each node that points kube-registry to the IP of the node itself; e.g.:

192.168.137.34  kube-registry

Now we need to deploy the NFS provisioner:

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm upgrade --atomic --install --namespace kube-system nfs-subdir-external-provisioner-ssd nfs-subdir-external-provisioner/nfs-subdir-external-provisioner --set nfs.server=192.168.137.253 --set nfs.path=/storage-ssd --set storageClass.name=nfs-ssd
helm upgrade --atomic --install --namespace kube-system nfs-subdir-external-provisioner-hdd nfs-subdir-external-provisioner/nfs-subdir-external-provisioner --set nfs.server=192.168.137.253 --set nfs.path=/storage-hdd --set storageClass.name=nfs-hdd

And bitnami-labs/sealed-secrets so we can safely store secrets in this repo:

Usage is something like this:

kubectl -n mynamespace create secret generic mysecret --dry-run=client --from-literal='key=value' -o yaml | kubeseal --controller-name=sealed-secrets --controller-namespace=kube-system -o yaml > mysealedsecret.yaml
helm upgrade --atomic --install --namespace kube-system sealed-secrets sealed-secrets/sealed-secrets --version 2.16.1

SSL certificate vending

cert-manager is a fantastic way to do SSL for free with Let's Encrypt; to ship that:

cd _cluster

# came from curl -L https://github.com/cert-manager/cert-manager/releases/download/v1.14.5/cert-manager.yaml
kubectl apply -f 1-cert-manager.yaml
kubectl apply -f 2-clusterissuer.yaml

Continuous Deployment

Argo isn't perfect but it seems to be the best out there; to ship that:

cd _cluster

helm repo add argo https://argoproj.github.io/argo-helm

# came from helm show values argo/argo-cd > 3-argocd-values.yaml (which needed some edits)
helm upgrade --atomic --install --namespace argo-cd --create-namespace argo-cd argo/argo-cd --version 7.0.0 --values 3-argocd-values.yaml

# dump out the secret
kubectl -n argo-cd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

At this point I needed to log in to the Argo UI and add this repo, and also add the cluster folder in this repo as the catalyst for the IaC.

Continuous Integration

TODO