-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock (?) between garbage collection and RcvQueue worker thread termination #83
Comments
I suspect that the problem is related to the one described here: https://sourceforge.net/p/udt/discussion/393036/thread/d95e119f/?limit=25#1c43
In my case, however, the file descriptor is not reused by UDT but by another part of the application which opens a completely unrelated TCP socket with the same file descriptor. This new socket is perfectly fine and will happily block in the |
…chart#83 Otherwise, the socket is closed and the file descriptor is freed prematurely. Due to a race condition the queues might then still be trying to use the old file descriptor which by now may already point to another unrelated socket. This may either lead to "stolen" data, data that is read accidentally by the queue worker for the already closed socket. Or it may lead to a complete deadlock, if the file descriptor now points to a blocking socket, so that `delete m->second.m_pRcvQueue` will never return because it joins the worker thread which blocks indefinitely on the wrong socket. After some time this deadlock will bring UDT completely to a halt because the above code holds the m_ControlLock into which all other work will run and block there after a while.
…chart#83 Otherwise, the socket is closed and the file descriptor is freed prematurely. Due to a race condition the queues might then still be trying to use the old file descriptor which by now may already point to another unrelated socket. This may either lead to "stolen" data, data that is read accidentally by the queue worker for the already closed socket. Or it may lead to a complete deadlock, if the file descriptor now points to a blocking socket, so that `delete m->second.m_pRcvQueue` will never return because it joins the worker thread which blocks indefinitely on the wrong socket. After some time this deadlock will bring UDT completely to a halt because the above code holds the m_ControlLock into which all other work will run and block there after a while. (cherry picked from commit ccb843e)
We observe a situation where UDT completely hangs with many threads stuck waiting for the
m_ControlLock
.At this point the lock is held by the garbage collection thread (in
checkBrokenSockets
) which is waiting for a rcv queue worker thread termination:The worker thread seems to be stuck in recvmsg:
This doesn't seem to be a classical deadlock, maybe it's more a problem with the blocking
recvmsg
call.Has anyone an idea how this could happen?
The text was updated successfully, but these errors were encountered: