-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One or more controllers fail to start up when starting all of them simultaneously #1934
Comments
Hello @bijoua29! We will need more debug information or logs for this particular case. It is very hard to pin point the issue you are mentioning. Why don't you spawn all the above controllers together in a single spawner?. This is the exact reason the PR #1805 is implemented. You can parse multiple controllers and multiple param files to the spawner and it should be able to handle everything. You can also use this part of the code : ros2_control/controller_manager/controller_manager/launch_utils.py Lines 103 to 139 in bb087e2
|
@saikishor Yes, I understand it is hard to pinpoint this issue. I am willing to generate more debug or log data but I will need to know exactly what. Meanwhile I will try the single spawner. Do you have an example of its usage in a launch file that I can look at? One question for my curiosity. Does the single spawner generate a single service call to the controller manager or still multiple service calls. If multiple, then I fear I might still see the issue. |
We use OpaqueFunction for our LaunchDescription: return LaunchDescription(
[SetEnvironmentVariable("ROS_LOG_DIR", launch_config.log_dir)]
+ declared_arguments
+ [OpaqueFunction(function=launch_setup)] so I don't think I can use the launch utilities in launch_utils.py but I can do the same thing as done in generate_controllers_spawner_launch_description_from_dict() in launch_setup |
Ok, I did take a look at the spawner code and it does generate a separate service call for each controller. However, since it is a single spawner node, I think it will be more robust as it there is only DDS discovery required for a single node. Whereas for multiple spawners, each one had to have its service client subscriber discovered by client manager, which led to problems under load. So I am hoping this will lead to a more robust startup for us. |
Awesome. Glad to hear that it does help with this single spawner. If you want things to be done in a single service call, pass Please let us know if this works for you? |
@saikishor Thanks for the response. That's nice you can do it in a single service call. Spawner has changed (for the better) quite a bit since I last looked at it. There are a few things for me to try. Since I have to test this in the target environment, and make changes manually there, the testing will take quite some time. I will let you know my results. I'm really hopeful this will work for me. |
@bijoua29 We thought of these kind of use cases in the first place and then stated improving it. It would be really great to see that it helps someone hahhaa |
@saikishor So some of our controllers are meant to be loaded and started as inactive while the rest should be started as active. How would I start some of the controllers as inactive when using a single spawner? |
You need to have separate single spawners for the group of controllers with the same desired initial state |
That's what I thought. I started doing it that way. |
So I converted the controllers startup to 2 spawners, one for controllers to start active and another to start inactive. Unfortunately, I still see failures starting up. The success rate is either the same as before or maybe even slightly worse. There was one benefit for the change to a single spawner - the startup time was slightly less. There seem to be different failure modes. Here are examples of them:
At this point, I am looking for guidance on how to further instrument the controller startup to determine where the problem is. |
@bijoua29 the logs you are providing are not of much help. we need more information.
This means something has happened and the main information is before it
my_controller doesn't exist in your spawner args. We cannot help you much in this case. In my opinion, the dictionary approach is much robust than the earlier one. If you find a solution, feel free to open a PR. Thank you! |
Yeah, this is just me trying to scrub any of our user-specific information from the logs
I'm actually following the dictionary approach but explicitly in our launch file as we are using OpaqueFunction and not LaunchDescription directly and so can't use the utility function in launch_utils.py. I guess you have no more guidance on how to get better log information. I have tried enabling the service introspection feature but that didn't yield any additional information. I enabled warning log level on the spawner and I get a "Failed to get result" message indicating the service response is not getting back to spawner within the timeout of 10s (I have increase this to 20s with no improvement). My assumption based on this error message is that the service server in controller manager is not discovering the spawner's service client reader for the service response and so it can't sent the response. I will try to see if I can enable any rmw-based log to validate this assumption although that might be hard to do. I also have an idea of decreasing the probability of this issue with a change to the spawner and so I might submit a PR if that works. |
@bijoua29 Can you share an example with the |
@saikishor I'm not sure there would be a clean way to extend as you would have to provide two functions, one for the launch description and another for the matching launch setup. Anyway, here are a couple of links for the https://answers.ros.org/question/396345/ros2-launch-file-how-to-convert-launchargument-to-string/ |
@saikishor I submitted #1947 as it reduces the failure of controller startup significantly. It is still not foolproof but it helps. The cumulative effect of using a single spawner, activating as a group and the PR I submitted improved the success rate significantly. It also reduced the startup time significantly. It still isn't foolproof as I said so I'm still pursuing the rmw angle. |
Describe the bug
When starting several controllers simultaneously using the spawner, one or more of the controllers fail to start up
To Reproduce
Steps to reproduce the behavior:
Expected behavior
All controllers come up without error
Screenshots
N/A
Environment (please complete the following information):
Additional context
The number of controllers that don't load is random. Out of our 16 controllers, we have 7 instances of a particular type of a custom controller. Anecdotally, it is usually one or more these controllers that don't load up.
I can't really produce a minimal example. In fact, we only see this on our target hardware and believe it is related to CPU load.
I can't repro this on my laptop as it is much more powerful than the target hardware. Additionally, I have tried to preload my laptop with additional load using 'stress' but I still couldn't repro the issue so it may be something else e.g. disk I/O.
I don't understand the error message as it indicates the controller is already loaded but it isn't.
Note; we religiously update our software to the latest rolling sync every month.
Anecdotally, it seems there was some improvement when we went from ros2_control version 4.18(~10% success rate) to 4.20(~40-50% success rate).
It seems this issue has cropped up only in the last couple of months. A few months ago, failures were fairly rare. The above error message also is something that is fairly new to us as previously the occasional failures just had the controller "process has died" message. I realize there has been a fair amount of work in the spawner area recently.
This is a very serious issue for us as it takes several tries for our SQA to start our software stack when they are testing.
The text was updated successfully, but these errors were encountered: