-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ray.init() unexpectedly hangs when -host-uds=all and -overlay2=none #11202
Comments
This is caused by a bug in the gofer filesystem client. One process creates a Unix domain socket file:
This creates a gofer.directfsDentry that retains the socket/unix/transport.HostBoundEndpoint representing the bound socket's host file descriptor. However, no reference is held on the dentry, so it can be evicted from the dentry cache, causing the endpoint to be lost. Another process connects to the socket before it is evicted:
After vigorous filesystem activity, the socket file's dentry is evicted, so when a later process connects to the same socket it gets a different endpoint:
(The gofer filesystem client synthesizes a gofer.endpoint in this case for historical reasons, cl/146172912 internally.) I don't think the HostBoundEndpoint can reasonably be recovered if the dentry is dropped, so I think the correct fix is to make dentries with an endpoint (or a pipe) unevictable by e.g. holding an extra reference that is dropped by UnlinkAt(), as we do for synthetic dentries. Simply leaking a reference is sufficient to fix the repro, but is obviously not a complete fix:
|
Fixes #11202 PiperOrigin-RevId: 699263300
Fixes #11202 PiperOrigin-RevId: 699263300
Fixes #11202 PiperOrigin-RevId: 699263300
Description
I was trying to run the ray python library in a container with container uds create permissions (i.e.,
-host-uds=create
or-host-uds=all
) and-overlay2=none
. I expect the initialization commandray.init()
to return quickly, but the command instead seems to hang. I observe the correct behavior of returning quickly when I use:-host-uds
that has no container create permissions, i.e., no flag or-host-uds=open
, along with-overlay2=none
-host-uds
with no overlay flagRunning the same command with
runc
produces the correct behavior of returning quickly.Running
gdb
in the container while it was running produced this outputindicating that the process was hanging while receiving a message over socket. Inspecting the fd reveals that this is a unix socket.
cc @thundergolfer @pawalt
Steps to reproduce
Dockerfile
Configure docker json
Configure the flags
-host-uds=all
and-overlay2=none
in/etc/docker/daemon.json
:I also observe the bug when
-host-uds=create
.Run with runsc
The expected behavior is that running the script should exit quickly. When running with
runsc
, I observe the following output before the process hangs:When running with
runc
, I observe the same output, but the process exits quickly.runsc version
docker version (if using docker)
uname
Linux ip-10-1-13-177.ec2.internal 5.15.0-301.163.5.2.el9uek.x86_64 #2 SMP Wed Oct 16 18:55:42 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
repo state (if built from source)
N/A
runsc debug logs (if available)
The text was updated successfully, but these errors were encountered: