Skip to content

Latest commit

 

History

History
80 lines (72 loc) · 5.72 KB

traefik.md

File metadata and controls

80 lines (72 loc) · 5.72 KB

Traefik

Traefik is used as a reverse proxy, certificate manager and loadbalancer for sn06 and sn07.
It is deployed using an ansible-playbook which uses usegalaxy-eu's Traefik role. This role internally initializes a swarm cluster on the target host, creates secrets, the specified network and docker swarm services.
Docker swarm was used for mainly two reasons:

  1. Secret handling: Secrets are not saved inside env files, but are encrypted on disk and only available to the container they are attached to.
  2. Scalability and future failover safe deployments: By using Docker swarm you can quite easily add a second proxy node and use e.g. keepalived as ingress.
  3. Features: Traefik comes with many options for middlewares and Plugins. If you would like, for instance, to block certain IP ranges, this could be easily done using a plugin.

Basics

Default user rocky
IP 132.230.103.37
Host traefik.galaxyproject.eu
Traefiks logs /var/log/traefik

To have a pretty output use hl:

sudo ./hl /var/log/traefik/traefik./glog --follow

The machine is a ESXi VM. The University provides this fancy dashboard. Log in with your university handle and password.

Docker swarm basics

docker service ls shows the "services" which are similar to the ones in kubernetes.
docker service rm deletes the service and all its respective containers, in case you would like to start from clean slate.
docker service logs would not work with Traefik, because it writes directly to /var/log/traefik

New subdomains

To add new subdomains, add a new line to the file files/traefik/rules/template-subdomains.yml in the infrastructure-playbook repo. The language there is a go template for a yaml file, which might look similar to ansible at first. (In case you wonder about the syntax).
The line should look like the ones above, like this scheme:

{{template "subdomain" "<your-subdomain-name>"}}

Be careful, the word subdomain in the second colum needs to stay literaly "subdomain", only the 3rd column is changed to the new subdomain, but without any .usegalaxy.eu.
Once this is deployed, Traefik will automatically create a router for it and fetch certificates for the subdomain as well as a wildcard certificate for ITs.
If you did everything correctly, the new router appears on Traefik's dashboard.

How to debug

🚑 Galaxy not reachable

  • Go to aws and sign in using the credentials in the vault (aws.yml)
  • Navigate to route53
  • click on hosted zones
  • then on usegalaxy.eu
  • change the A Record for usegalaxy.eu and point it to sn06's public IP address (132.230.223.239)
  • The record usually has a TTL of 7200s, which means, that after 2h all requests should get sn06's IP instead of Traefik's.
  • In order to bridge that time, you can install nginx on Traefik and proxy_pass all requests to one headnode directly.

usegalaxy.eu /subdomain is showing a plain 404 not found

Most likely something happened to the router.

  • Check the dashboard via tailscale
  • If all routers look fine, check if something happened to the rulefile in files/traefik/rules/ usegalaxy-eu-router.yml for usegalaxy.eu and template-subdomains.yml for subdomains. Take a close look at the HostRegexp rule.
  • Less likely: check that the servers in usegalaxy-eu-service.yml are correct and reachable.

Bad Gateway error

  • One headnode unhealthy? You can test with e.g. sn07.galaxyproject.eu directly. Traefik should automatically skip unhealth hosts, see the dashboard and docs
  • Can be faulty certificates. When checked everything else, make a backup of /etc/traefik/acme.json and delete all its contents (not the file itself), then restart Traefik using docker restart <Traefik container name>
  • Service configuration might be faulty: check usegalaxy-eu-service.yml

no available server

This could mean that either

  • both headnodes are down
  • Traefik was unable to get a response when doing the healthCheck
  • in the latter case, check if the DNS inside the container works. If it fails, tailscale might be the issue.
sudo tailscale up --accept-dns=false --advertise-tags=tag:critical
docker restart <Traefik container name>

This should not be necessary, because it is set in group_vars/all.yml

Self signed certificate warning

Could only appear when many certs have to be fetched newly at the same time.
AWS route53 has a harsh rate limit of 5 req/s, if Traefik tries to create and check the TXT records during DNS-01 challenge for all subdomains, this could result in >100 req/s. It will take some time and Traefik will get more and more certs. If you see error messages after 1h, you can try to restart Traefik.

Letsencrypt i/o error

Probably a egress issue with Docker networks. Recreating the bridge helped:

# systemctl stop docker
# iptables -t net -F
# ip link set docker0 down
# brctl delbr br100
# systemctl start docker