Traefik is used as a reverse proxy, certificate manager and loadbalancer for sn06
and sn07
.
It is deployed using an ansible-playbook which uses usegalaxy-eu's Traefik role. This role internally initializes a swarm cluster on the target host, creates secrets, the specified network and docker swarm services.
Docker swarm was used for mainly two reasons:
- Secret handling: Secrets are not saved inside env files, but are encrypted on disk and only available to the container they are attached to.
- Scalability and future failover safe deployments: By using Docker swarm you can quite easily add a second proxy node and use e.g. keepalived as ingress.
- Features: Traefik comes with many options for middlewares and Plugins. If you would like, for instance, to block certain IP ranges, this could be easily done using a plugin.
Default user | rocky |
IP | 132.230.103.37 |
Host | traefik.galaxyproject.eu |
Traefiks logs | /var/log/traefik |
To have a pretty output use hl
:
sudo ./hl /var/log/traefik/traefik./glog --follow
The machine is a ESXi VM. The University provides this fancy dashboard. Log in with your university handle and password.
docker service ls
shows the "services" which are similar to the ones in kubernetes.
docker service rm
deletes the service and all its respective containers, in case you would like to start from clean slate.
docker service logs
would not work with Traefik, because it writes directly to /var/log/traefik
To add new subdomains, add a new line to the file files/traefik/rules/template-subdomains.yml
in the infrastructure-playbook repo. The language there is a go
template for a yaml
file, which might look similar to ansible at first. (In case you wonder about the syntax).
The line should look like the ones above, like this scheme:
{{template "subdomain" "<your-subdomain-name>"}}
Be careful, the word subdomain
in the second colum needs to stay literaly "subdomain", only the 3rd column is changed to the new subdomain, but without any .usegalaxy.eu
.
Once this is deployed, Traefik will automatically create a router for it and fetch certificates for the subdomain as well as a wildcard certificate for ITs.
If you did everything correctly, the new router appears on Traefik's dashboard.
- Go to aws and sign in using the credentials in the vault (
aws.yml
) - Navigate to
route53
- click on
hosted zones
- then on
usegalaxy.eu
- change the
A Record
forusegalaxy.eu
and point it to sn06's public IP address (132.230.223.239
) - The record usually has a TTL of 7200s, which means, that after 2h all requests should get sn06's IP instead of Traefik's.
- In order to bridge that time, you can install nginx on Traefik and
proxy_pass
all requests to one headnode directly.
Most likely something happened to the router.
- Check the dashboard via tailscale
- If all routers look fine, check if something happened to the rulefile in
files/traefik/rules/
usegalaxy-eu-router.yml
for usegalaxy.eu andtemplate-subdomains.yml
for subdomains. Take a close look at theHostRegexp
rule. - Less likely: check that the
servers
inusegalaxy-eu-service.yml
are correct and reachable.
- One headnode unhealthy? You can test with e.g.
sn07.galaxyproject.eu
directly. Traefik should automatically skip unhealth hosts, see the dashboard and docs - Can be faulty certificates. When checked everything else, make a backup of
/etc/traefik/acme.json
and delete all its contents (not the file itself), then restart Traefik usingdocker restart <Traefik container name>
- Service configuration might be faulty: check
usegalaxy-eu-service.yml
This could mean that either
- both headnodes are down
- Traefik was unable to get a response when doing the healthCheck
- in the latter case, check if the DNS inside the container works. If it fails, tailscale might be the issue.
sudo tailscale up --accept-dns=false --advertise-tags=tag:critical
docker restart <Traefik container name>
This should not be necessary, because it is set in group_vars/all.yml
Could only appear when many certs have to be fetched newly at the same time.
AWS route53
has a harsh rate limit of 5 req/s, if Traefik tries to create and check the TXT records
during DNS-01 challenge
for all subdomains, this could result in >100 req/s. It will take some time and Traefik will get more and more certs. If you see error messages after 1h, you can try to restart Traefik.
Probably a egress issue with Docker networks. Recreating the bridge helped:
# systemctl stop docker
# iptables -t net -F
# ip link set docker0 down
# brctl delbr br100
# systemctl start docker