-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incident - High number of builds on GESIS server on 25 January #2905
Comments
Thank you for reporting :) |
Binder PodSome builds failed with
but this happens all the time. Some callback error like
that also happens all the time. What catched my attention was
For a couple of minutes, the only entry in the log is health check. And later the health check start failing
But, 2 minutes later, the health check starts work again.
Later, Hub API stop working:
@arnim I believe that something on the GESIS network might have caused this. I'm excluding something on the Kubernetes network because we didn't change anything. Build PodI didn't found anything relevant. |
This is old and I cannot look at the logs. |
Around 5:00am UTC+1 of 25 January 2025, GESIS server started to accumulate build pods.
The increase number of build pods continuers from 5:00am UTC+1 until 7:30am UTC+1 when it abruptly dropped.
At the same time, number of user pods on GESIS server decreased.
GESIS server was NOT cordoned during this period.
And CPU load on GESIS server was average during this period.
The popular repositories before the incident were
GESIS server looks normal now. cc @arnim
The text was updated successfully, but these errors were encountered: