Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatcher deadlock #10482

Closed
v-lopez opened this issue May 9, 2022 · 13 comments
Closed

Dispatcher deadlock #10482

v-lopez opened this issue May 9, 2022 · 13 comments

Comments

@v-lopez
Copy link

v-lopez commented May 9, 2022


Required Info
Camera Model D455
Firmware Version 05.13.00.50
Operating System & Version Linux
Kernel Version (Linux Only) 5.13.0-40-generic
Platform PC
SDK Version v2.50.0
Language C++
Segment Robot

Issue Description

I have a deadlock when starting my realsense node.

I have 5 cameras connected, but I can reproduce this with just one camera connected.

I am launching it with initial_reset:=true, the device enumeration phase gets stuck and never ends:

[ INFO] [1652093883.641541163]: Initializing nodelet with 8 worker threads.
[ INFO] [1652093884.313300323]: RealSense ROS v2.3.2
[ INFO] [1652093884.313323562]: Built with LibRealSense v2.50.0
[ INFO] [1652093884.313337444]: Running with LibRealSense v2.50.0
[ INFO] [1652093884.351036837]:  
[ INFO] [1652093884.364277927]: Device with serial number 141322251881 was found.

[ INFO] [1652093884.364313104]: Device with physical ID /sys/devices/pci0000:00/0000:00:0d.0/usb2/2-3/2-3:1.0/video4linux/video6 was found.
[ INFO] [1652093884.364331768]: Device with name Intel RealSense D455 was found.
[ INFO] [1652093884.364670292]: Device with port number 2-3 was found.
[ INFO] [1652093884.476617147]: Device with serial number 141322251965 was found.

[ INFO] [1652093884.476652851]: Device with physical ID /sys/devices/pci0000:00/0000:00:0d.0/usb2/2-4/2-4:1.0/video4linux/video18 was found.
[ INFO] [1652093884.476670643]: Device with name Intel RealSense D455 was found.
[ INFO] [1652093884.476874191]: Device with port number 2-4 was found.
[ INFO] [1652093884.574994608]: Device with serial number 141322251883 was found.

[ INFO] [1652093884.575037884]: Device with physical ID /sys/devices/pci0000:00/0000:00:14.0/usb4/4-1/4-1:1.0/video4linux/video0 was found.
[ INFO] [1652093884.575053133]: Device with name Intel RealSense D455 was found.
[ INFO] [1652093884.575364006]: Device with port number 4-1 was found.
[ INFO] [1652093884.665959868]: Device with serial number 141322250914 was found.

[ INFO] [1652093884.665985571]: Device with physical ID /sys/devices/pci0000:00/0000:00:14.0/usb4/4-3/4-3:1.0/video4linux/video12 was found.
[ INFO] [1652093884.665995978]: Device with name Intel RealSense D455 was found.
[ INFO] [1652093884.666168420]: Device with port number 4-3 was found.
[ INFO] [1652093884.776302875]: Device with serial number 141322252685 was found.

[ INFO] [1652093884.776331694]: Device with physical ID /sys/devices/pci0000:00/0000:00:14.0/usb4/4-4/4-4:1.0/video4linux/video24 was found.
[ INFO] [1652093884.776345106]: Device with name Intel RealSense D455 was found.
[ INFO] [1652093884.776582970]: Device with port number 4-4 was found.
[ INFO] [1652093884.776621881]: Device USB type: 3.2
[ INFO] [1652093884.776650379]: Resetting device...
[ INFO] [1652093891.478201267]:  
[ INFO] [1652093891.490861140]: Device with serial number 141322251881 was found.

[ INFO] [1652093891.490880066]: Device with physical ID /sys/devices/pci0000:00/0000:00:0d.0/usb2/2-3/2-3:1.0/video4linux/video6 was found.
[ INFO] [1652093891.490885771]: Device with name Intel RealSense D455 was found.
[ INFO] [1652093891.491050484]: Device with port number 2-3 was found.

Upon attaching with GDB I can see that thread 17 is a dispatcher thread stuck on this line, having acquired _dispatch_mutex. And following the callstack it is stuck on this line, waiting for the _devices_changed_callbacks_mtx.

On the other hand, thread 20 is destroying the same realsense device here, which is calling unregister_internal_device_callback, acquiring _devices_changed_callbacks_mtx and then calling _device_watcher->stop(); here. Which is requires _dispatch_mutex, but it is possessed by thread 17 causing the deadlock.

The full callstack are below:
thread_20.txt
thread_17.txt

I have had this happen as many times as attempts I've made. I have rebooted all realsense applications but not rebooted my computer yet, which may clear the issue, but I worried about this happening in the future in a customer facility.

@MartyG-RealSense
Copy link
Collaborator

Hi @v-lopez May I first ask if the problem still occurs if you do not include initial_reset:=true in the launch instruction. initial_reset is optional and can be useful if problems are occurring during launch or after launch. But it is not a requirement to set it to true.

@v-lopez
Copy link
Author

v-lopez commented May 9, 2022

It doesn't occur without initial_reset, but I had to add initial_reset due to other issues.

@MartyG-RealSense
Copy link
Collaborator

I also note that you are using kernel 5.13.0-40-generic. The librealsense SDK does not officially support kernel 5.13 and although the SDK can work with unsupported kernels it can have unforseen consequences in regards to stability. The most recent supported kernels are 5.4 when building the SDK from Debian packages, and 5.8 and 5.11 when building it from source code.

The kernel can though be bypassed if building the SDK from source code with the RSUSB backend installation method, which is not dependent on Linux versions or kernel versions and does not require patching.

@v-lopez
Copy link
Author

v-lopez commented May 9, 2022

I am running it with RSUSB_BACKEND off, as I had trouble running 5 cameras with it.
Although it is my impression that this failure may happen regardless of the backend and kernel.

@MartyG-RealSense
Copy link
Collaborator

MartyG-RealSense commented May 9, 2022

What other issues have you been experiencing when initial_reset is not true, please?

@v-lopez
Copy link
Author

v-lopez commented May 9, 2022

I don't have the logs anymore, but some of the streams such as Infra1 were not starting.

@MartyG-RealSense
Copy link
Collaborator

The reference to _devices_changed_callbacks_mtx gives me the impression that when a camera is reset during the launch, the computer cannot find it again after it has disconnected when reset (devices_changed_callback handles events related to listening for camera connection and disconnection).

Are you using the official 1 meter USB cables supplied with the camera or your own choice of USB cable?

@v-lopez
Copy link
Author

v-lopez commented May 9, 2022

No, another 0.5m USB3.2 cable that we found was able to deliver what we needed.

I wrote a workaround in the library to stop the dispatched thread before at the beginning of the device destructor and I avoid the deadlock and the system works as expected. But I don't know the code well enough to determine if this makes sense.

@MartyG-RealSense
Copy link
Collaborator

MartyG-RealSense commented May 9, 2022

There is no information that I know of regarding the effects of using a shorter USB cable than 1 meter with RealSense cameras unfortunately as USB cable references usually relate to 1 m length or greater. Quality matters with cables used with RealSense cameras though due to the high volume of data that the cameras can transmit along the cable and it should be a cable designed for data transfer rather than just device charging. Having said that, if your 0.5 m cable works well with your workaround then it indicates that your cable choice is fine.

In regard to whether your workaround is 'correct', I am not involved in SDK development as I am a Support Engineer and so do not have advice that I can offer on your method. I would recommend performing long-run tests with the 5 cameras for periods such as 12 hours or longer to confirm whether your changes are stable. If they are stable after repeated successful long-run tests then the workaround is likely fine.

@MartyG-RealSense
Copy link
Collaborator

Hi @v-lopez Do you require further assistance with this case, please? Thanks!

@v-lopez
Copy link
Author

v-lopez commented May 16, 2022

No, if you want me to submit my patch as a PR let me know.

@MartyG-RealSense
Copy link
Collaborator

A PR would be a useful reference for other RealSense users with a similar problem and you are very welcome to submit one. It is of course completely optional though whether you do so or not. Thanks again!

@maloel
Copy link
Collaborator

maloel commented Oct 25, 2023

This deadlock is now fixed, as far as we know, on the development branch. Please see #12275.
Note that this PR cannot be taken alone. It is based on many other changes (development is still in the middle of some refactoring) and it was hard to isolate so, unfortunately, it cannot be simply applied to released code.
If anyone feels like trying our latest code, I'd be interested to hear if you still see any sort of problematic deadlock.
Cheers!

meyerj pushed a commit to Intermodalics/librealsense that referenced this issue Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants