Connect to your appliance via SSH:
ssh -p 122 [email protected]
The exceptions log is JSON, and can be hard on the eyes. jq can help you make more sense of it.
Readable exception messages with times. The -c flag compacts the output so it's all on one line.
jq -c '{created_at, message}' /var/log/github/exceptions.log
Count exceptions that aren't slow requests or queries and find the most common ones.
grep -vi slow /var/log/github/exceptions.log | jq .message | sort | uniq -c | sort -nr | head
Print a readable backtrace for an exception:
grep -i -m1 "timeout" /var/log/github/exceptions.log | jq -r .backtrace
Print everything except the backtrace:
grep -i -m1 "timeout" /var/log/github/exceptions.log | jq 'del(.backtrace)'
Top repos with slow requests:
grep SlowRequest /var/log/github/exceptions.log | jq '.repo' | sed -e 's/.*\///' | sort | uniq -c | sort -rn
All Git requests go through babeld, so it's a useful place to look for Git traffic. The babeld log format is a series of key=value pairs. Run the following commands on the appliance. This will give you some initial insight into Git operations on the appliance.
10 minute request counts overall, which can be useful spotting sudden increases in traffic that can be an indication of misconfigured or poorly written scripts for example.
cut -c 1-15 /var/log/babeld/babeld.log | uniq -c
1 minute request counts overall, which can be useful spotting sudden increases in traffic that can be an indication of misconfigured or poorly written scripts for example.
cut -c 1-16 /var/log/babeld/babeld.log | uniq -c
Top IP addresses, and often one IP or subnet stands out as misbehaving.
grep -o 'ip=[^ ]*' /var/log/babeld/babeld.log | sort | uniq -c | sort -nr | head
Filter by IP addresses, to see when it spiked
grep 172.18.1.201 /var/log/babeld/babeld.log | cut -c 1-13 | uniq -c
Top requested repositories
grep -o 'repo=[^ ]*' /var/log/babeld/babeld.log | sort | uniq -c | sort -nr | head
Top requested organizations or users
grep -o 'repo=[^/]*' /var/log/babeld/babeld.log | sort | uniq -c | sort -nr | head
You can also download a diagnostics file to your local machine (in the root of d:\ with the example below) by running:
ssh-p 122 [email protected] ghe-diagnostics > /d/ghe-diagnostics_$(DATE).txt
Only look at a specific protocol.
grep proto=http babeld-logs/babeld.log | grep -o 'repo=[^/]*' | sort | uniq -c | sort -nr | head
Extract the duration_ms value and sort to find out how much time is taken for the longest running operations
grep 'my/repo' babeld-logs/babeld.log | sed -e "s/.*duration_ms=\([^ ]*\).*/\1/" | sort -nr | head -n 20
Feed the duration_ms high score table back into grep to inspect the full babeld event
grep 'my/repo' babeld-logs/babeld.log | sed -e "s/.*duration_ms=\([^ ]*\).*/\1/" | sort -nr | head -n 5 | grep -f - babeld-logs/babeld.log
Number of pushes in 10 minute intervals:
grep receive-pack babeld-logs/babeld.log | cut -c 1-15 | uniq -c
Number of clones in 10 minute intervals:
grep upload-pack babeld-logs/babeld.log | cut -c 1-15 | uniq -c
Top 10 repos with pushes in a specific 10 minute interval:
grep -F 'Wed Jun 1 15:5' babeld-logs/babeld.log | grep -F receive-pack | grep -oP 'repo=[^ ]*' | sort | uniq -c | sort -rn | head
Users failing authentication - helpful in spotting misconfigured systems that may be tying up auth workers.
grep -F 'at=failure' github-logs/auth.log | grep -o 'login=[^ ]*'| sort | uniq -c | sort -nr | head
Authentication failure reason for a specific user.
grep -F 'at=failure' github-logs/auth.log | grep 'login=username' | grep -o 'failure_type=[^ ]*' | sort | uniq -c
Authentications failing because of invalid LDAP usernames.
grep "Invalid ldap username" github-logs/auth.log | grep -o "login=[^ ]* " | sort | uniq -c | sort -nr | head
LDAP connection issues or request saturation through polling. Reported when gitauth worker process attempting LDAP authentication takes too long to respond. This can contribute to load as other requests may be forced to wait during the timeout period.
grep "unexpected return code from _gitauth" babeld-logs/babeld.log | cut -c 1-15 | uniq -c
Gitauth workers terminated due to timeout. Could be a failing or slow to respond LDAP server. Is also possible to adjust LDAP authentication timeout settings.
grep "unicorn worker killed for taking too long" github-logs/exceptions.log | grep ldap | jq .created_at | cut -c 1-15 | uniq -c
Top Web or API requests by User agent
awk '{print $12}' web-logs/github.log | sort | uniq -c | sort -nr | head
haproxy.log API requests in 10 minute intervals:
grep -F 'api/v3' system-logs/haproxy.log | cut -c 1-11 | uniq -c
Top API endpoints:
grep -F 'api/v3' system-logs/haproxy.log | grep -o "} \".*" | sort | uniq -c | sort -nr | head -n 10
Top API endpoints in a specific 10 minute interval:
grep -F 'Jun 1 16:0' system-logs/haproxy.log | grep -F 'api/v3' | grep -o "} \".*" | sort | uniq -c | sort -nr | head -n 10
Top 10 API IPs in a specific 10 minute interval
grep -F 'Jun 1 16:0' system-logs/haproxy.log | grep -F 'api/v3' | grep -oP "]: [^:]*" | sort | uniq -c | sort -nr | head -n 10
Top IP request counts in 10 minute intervals:
grep '[IP]' system-logs/haproxy.log.1 | cut -c 1-11 | uniq -c
Check for slow disk IO:
grep "disk is busy" redis-logs/redis.log
Strip syslog noise to look for abnormalities:
grep -v -e UFW -e CRON -e syslog-ng system-logs/syslog | less
Hypervisor CPU stealing:
grep "soft lockup" system-logs/systemd/journalctl-k.log
Break down the previous replication pair by running:
ghe-repl-stop && ghe-repl-teardown
on the previous replica, before setting up a new replica. Running both in parallel is not supported.
Run this on the replica instance:
ghe-repl-status -vv
and this on the primary
ghe-spokes status
We can address the repositories with bad checksums. To gain more insight into the problems there, please send support the output of the following from the primary instance: For organization repos
ghe-spokes diagnose ORG/Repo e.g. ghe-spokes diagnose GR/HBS
For user repos
ghe-spokes diagnose user/repo. e.g. ghe-spokes diagnose markri/personal1
In some cases an old replica remains. Use the command below to get the UUID of the valid primary and replica servers.
cat /data/user/common/uuid
In some cases an old replica remains. Use the command above to get the UUID of the valid primary and replica servers, and then remove the old entry by destroying it using the foillowing command.
ghe-spokes server destroy git-server-590345ea-1453-11e7-a18e-0050569d7f1a
The gpgverify service on the appliance might be having trouble starting correctly, likely due to a known issue which manifests when the instance is not cleanly restarted. The issue is fixed in the 2.10.4 release, but please run the following on the primary to resolve it locally:
sudo systemctl stop gpgverify
sudo rm /data/gpgverify/current/tmp/sockets/gpgverify.sock
sudo systemctl start gpgverify
ghe-support-bundle errors Sometimes when you try generate logs you get the following:
ssh -p 122 [email protected] -- 'ghe-support-bundle -x -o' > /d/syslog/support/support-bundle.tgz
mkdir: cannot create directory '/var/log/haproxy.log.1': File exists
$ ssh -p 122 [email protected] -- 'ghe-support-bundle -o' > /d/syslog/support/support-bundle.tgz
mkdir: cannot create directory '/var/log/haproxy.log.1': File exists
This is fixed in 2.10.4, but the workaround is:
sudo sed -i 's#sanitize_logs /var/log/haproxy.log\*#sanitize_logs "/var/log/haproxy.log*"#' /usr/local/bin/ghe-support-bundle