-
Notifications
You must be signed in to change notification settings - Fork 8
Scheduler tasks can hang on dropped HTTP connection #91
Comments
Currently 0 worker count on SV. Scheduler restart appears to not have solved the issue |
Hmm, how long did you wait after the restart? I believe the worker count cache code only runs every 10 minutes. If this is the case then perhaps I misdiagnosed.... |
So to clarify - I'm not 100% sure how long I waited - I'd guestimate ~15 mins. Its quite possible that whatever is causing the issue happened soonish after I restarted, although I did make several attempts. I have several times restarted the scheduler and have it fix the bug. The bug is also occuring (pretty frequently) on SimpleDoge atm as well - so not related to Geo. |
Good to know, I'll look into it more once multi is out. |
It actually may be a bug in the new code. I'm not sure exactly when it first appeared - but its only been on Doge + Vert, and only just recently. |
We haven't deployed code to them in over 3 weeks, so it seems unlikely to be a code change. OVH's network definitely has seemed more flaky lately, so I'm betting it's related to that. |
I was thinking it could be an issue introduced by the new Powerpool code. Possibly its setup to handle http requests differently or something? idk, it doesn't seem terribly likely - but the timing is suspicious |
We're not running new powerpools on Doge yet, so probably not. |
Yea thats right, doh. Well its not a network issue - Doge is having the problem, and it doesn't use Geos. |
Tasks that make remote requests can hang indefinitely if a socket connection is silently dropped. Since APScheduler will not run two instances of the same task at once then the situation never resolves itself without a restart of the scheduler. Simple solutions is to set:
socket._GLOBAL_DEFAULT_TIMEOUT = 60
to cause all socket connections to eventually timeout. This is likely the cause of Celery hanging on the simplecrypto/pool_list as well.
Manifested by Worker Count going to 0 on SimpleVert.com when a connection to a geo stratum was dropped.
The text was updated successfully, but these errors were encountered: