Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One or more controllers fail to start up when starting all of them simultaneously #1934

Closed
bijoua29 opened this issue Dec 12, 2024 · 16 comments · Fixed by #1949
Closed

One or more controllers fail to start up when starting all of them simultaneously #1934

bijoua29 opened this issue Dec 12, 2024 · 16 comments · Fixed by #1949
Labels

Comments

@bijoua29
Copy link
Contributor

bijoua29 commented Dec 12, 2024

Describe the bug
When starting several controllers simultaneously using the spawner, one or more of the controllers fail to start up

To Reproduce
Steps to reproduce the behavior:

  1. Start our launch file with several controllers(in our case, we start 16 controllers)
  2. See following error:
[ros2_control_node-2] [ERROR] [1733859175.169085071] [controller_manager]: A controller named 'my_controller_1' was already loaded inside the controller manager
[ros2_control_node-2] [ERROR] [1733859175.491164152] [controller_manager]: A controller named 'my_controller_2' was already loaded inside the controller manager
[ros2_control_node-2] [ERROR] [1733859175.491519034] [controller_manager]: A controller named 'my_controller_3' was already loaded inside the controller manager
[ros2_control_node-2] [ERROR] [1733859175.492405566] [controller_manager]: A controller named 'my_controller_1' was already loaded inside the controller manager
[my_controller_1_spawner-13] [FATAL] [1733859175.539516796] [spawner_my_controller_1]: Failed loading controller my_controller_1
[my_controller_2_spawner-9] [FATAL] [1733859175.577969123] [spawner_my_controller_2]: Failed loading controller my_controller_2
[my_controller_3_spawner-11] [FATAL] [1733859175.578179280] [spawner_my_controller_3]: Failed loading controller my_controller_3
[ERROR] [my_controller_3_spawner-11]: process has died [pid 3598950, exit code 1, cmd 'ros2 run controller_manager spawner my_controller_3 --param-file /tmp/my_controller_3.yaml --controller-manager-timeout 40 --ros-args --log-level ERROR --log-file-name my_controller_3_spawner'].
[ERROR] [my_controller_2_spawner-9]: process has died [pid 3598944, exit code 1, cmd 'ros2 run controller_manager spawner my_controller_2 --param-file /tmp/my_controller_2.yaml --controller-manager-timeout 40 --ros-args --log-level ERROR --log-file-name my_controller_2_spawner'].
[ERROR] [my_controller_1_spawner-13]: process has died [pid 3598954, exit code 1, cmd 'ros2 run controller_manager spawner my_controller_1 --param-file /tmp/my_controller_1.yaml --controller-manager-timeout 40 --ros-args --log-level ERROR --log-file-name my_controller_1_spawner'].

Expected behavior
All controllers come up without error

Screenshots
N/A

Environment (please complete the following information):

  • OS: Ubuntu 24.04
  • Version: rolling sync from 2024-11-26
  • ros2_control version: 4.20.0

Additional context
The number of controllers that don't load is random. Out of our 16 controllers, we have 7 instances of a particular type of a custom controller. Anecdotally, it is usually one or more these controllers that don't load up.

I can't really produce a minimal example. In fact, we only see this on our target hardware and believe it is related to CPU load.

I can't repro this on my laptop as it is much more powerful than the target hardware. Additionally, I have tried to preload my laptop with additional load using 'stress' but I still couldn't repro the issue so it may be something else e.g. disk I/O.

I don't understand the error message as it indicates the controller is already loaded but it isn't.

Note; we religiously update our software to the latest rolling sync every month.

Anecdotally, it seems there was some improvement when we went from ros2_control version 4.18(~10% success rate) to 4.20(~40-50% success rate).

It seems this issue has cropped up only in the last couple of months. A few months ago, failures were fairly rare. The above error message also is something that is fairly new to us as previously the occasional failures just had the controller "process has died" message. I realize there has been a fair amount of work in the spawner area recently.

This is a very serious issue for us as it takes several tries for our SQA to start our software stack when they are testing.

@bijoua29 bijoua29 added the bug label Dec 12, 2024
@saikishor
Copy link
Member

saikishor commented Dec 12, 2024

Hello @bijoua29!

We will need more debug information or logs for this particular case. It is very hard to pin point the issue you are mentioning.

Why don't you spawn all the above controllers together in a single spawner?. This is the exact reason the PR #1805 is implemented. You can parse multiple controllers and multiple param files to the spawner and it should be able to handle everything.

You can also use this part of the code :

def generate_controllers_spawner_launch_description_from_dict(
controller_info_dict: dict, extra_spawner_args=[]
):
"""
Generate launch description for loading a controller using spawner.
controller_info_dict: dict
A dictionary with the following info:
- controller_name: str
The name of the controller to load as the key
- controller_params_file: str or list or None
The path to the controller parameter file or a list of paths to multiple parameter files
or None if no parameter file is needed as the value of the key
If a list is passed, the controller parameters will be overloaded in same order
extra_spawner_args: list
A list of extra arguments to pass to the controller spawner
"""
if not type(controller_info_dict) is dict:
raise ValueError(f"Invalid controller_info_dict type parsed {controller_info_dict}")
controller_names = controller_info_dict.keys()
controller_params_files = []
for controller_name in controller_names:
controller_params_file = controller_info_dict[controller_name]
if controller_params_file:
if type(controller_params_file) is list:
controller_params_files.extend(controller_params_file)
elif type(controller_params_file) is str:
controller_params_files.append(controller_params_file)
else:
raise ValueError(
f"Invalid controller_params_file type parsed in the dict {controller_params_file}"
)
return generate_controllers_spawner_launch_description(
controller_names=controller_names,
controller_params_files=controller_params_files,
extra_spawner_args=extra_spawner_args,
)
and just pass a dictionary of elements and it should work better and more robust in my opinion. Take a look and let us know.

@bijoua29
Copy link
Contributor Author

bijoua29 commented Dec 12, 2024

@saikishor Yes, I understand it is hard to pinpoint this issue. I am willing to generate more debug or log data but I will need to know exactly what. Meanwhile I will try the single spawner. Do you have an example of its usage in a launch file that I can look at?

One question for my curiosity. Does the single spawner generate a single service call to the controller manager or still multiple service calls. If multiple, then I fear I might still see the issue.

@bijoua29
Copy link
Contributor Author

bijoua29 commented Dec 12, 2024

We use OpaqueFunction for our LaunchDescription:

return LaunchDescription(
        [SetEnvironmentVariable("ROS_LOG_DIR", launch_config.log_dir)]
        + declared_arguments
        + [OpaqueFunction(function=launch_setup)]

so I don't think I can use the launch utilities in launch_utils.py but I can do the same thing as done in generate_controllers_spawner_launch_description_from_dict() in launch_setup

@bijoua29
Copy link
Contributor Author

Ok, I did take a look at the spawner code and it does generate a separate service call for each controller. However, since it is a single spawner node, I think it will be more robust as it there is only DDS discovery required for a single node. Whereas for multiple spawners, each one had to have its service client subscriber discovered by client manager, which led to problems under load. So I am hoping this will lead to a more robust startup for us.

@saikishor
Copy link
Member

Ok, I did take a look at the spawner code and it does generate a separate service call for each controller. However, since it is a single spawner node, I think it will be more robust as it there is only DDS discovery required for a single node. Whereas for multiple spawners, each one had to have its service client subscriber discovered by client manager, which led to problems under load. So I am hoping this will lead to a more robust startup for us.

Awesome. Glad to hear that it does help with this single spawner. If you want things to be done in a single service call, pass --activate-as-group parameter in the extra args.

Please let us know if this works for you?

@bijoua29
Copy link
Contributor Author

@saikishor Thanks for the response. That's nice you can do it in a single service call. Spawner has changed (for the better) quite a bit since I last looked at it. There are a few things for me to try. Since I have to test this in the target environment, and make changes manually there, the testing will take quite some time. I will let you know my results. I'm really hopeful this will work for me.

@saikishor
Copy link
Member

saikishor commented Dec 12, 2024

@bijoua29 We thought of these kind of use cases in the first place and then stated improving it. It would be really great to see that it helps someone hahhaa

@bijoua29
Copy link
Contributor Author

@saikishor So some of our controllers are meant to be loaded and started as inactive while the rest should be started as active. How would I start some of the controllers as inactive when using a single spawner?

@saikishor
Copy link
Member

@saikishor So some of our controllers are meant to be loaded and started as inactive while the rest should be started as active. How would I start some of the controllers as inactive when using a single spawner?

You need to have separate single spawners for the group of controllers with the same desired initial state

@bijoua29
Copy link
Contributor Author

That's what I thought. I started doing it that way.

@bijoua29
Copy link
Contributor Author

bijoua29 commented Dec 13, 2024

So I converted the controllers startup to 2 spawners, one for controllers to start active and another to start inactive. Unfortunately, I still see failures starting up. The success rate is either the same as before or maybe even slightly worse. There was one benefit for the change to a single spawner - the startup time was slightly less.

There seem to be different failure modes. Here are examples of them:

Sometimes, it fails silently, The spawner doesn't finish and there are no error messages. That is, the following message is not seen for the spawner.

[INFO] [spawner-7]: process has finished cleanly [pid 8031]

In this scenario, all the controllers that were started by the spawner are not loaded
[ros2_control_node-2] [ERROR] [1734049001.249673796] [controller_manager]: A controller named 'yyy_controller' was already loaded inside the controller manager
[spawner-7] [FATAL] [1734049001.298668402] [spawner_xxx_controller]: Failed loading controller my_controller
[ERROR] [spawner-7]: process has died [pid 11553, exit code 1, cmd '/home/user/workspace/ws/install/controller_manager/lib/controller_manager/spawner xxx_controller yyy_controller zzz_controller aaa_controller bbb_controller --controller-manager controller_manager --param-file /tmp/params_controller.yaml --controller-manager-timeout 40 --inactive --ros-args --log-level ERROR --log-file-name active_controllers_spawner --ros-args'].
[ros2_control_node-2] [ERROR] [1734049383.909486794] [controller_manager]: Aborting, no controller is switched! (::STRICT switch)
[spawner-6] [ERROR] [1734049384.001362490] [spawner_joint_state_broadcaster]: Failed to activate controller
[ERROR] [spawner-6]: process has died [pid 22629, exit code 1, cmd '/home/user/workspace/ws/install/controller_manager/lib/controller_manager/spawner joint_state_broadcaster controller_1 controller_2 controller_3 controller_4 controller_5 controller_6 controller_7 controller_a controller_b controller_c --controller-manager controller_manager --param-file params_1.yaml --param-file params_2yaml --param-file params_3.yaml --param-file params_4.yaml --param-file params_5.yaml --param-file params_6.yaml --param-file params_7.yaml --controller-manager-timeout 40 --ros-args --log-level ERROR --log-file-name active_controllers_spawner --ros-args'].

At this point, I am looking for guidance on how to further instrument the controller startup to determine where the problem is.

@saikishor
Copy link
Member

@bijoua29 the logs you are providing are not of much help. we need more information.

[ros2_control_node-2] [ERROR] [1734049383.909486794] [controller_manager]: Aborting, no controller is switched! (::STRICT switch)

This means something has happened and the main information is before it

[spawner-7] [FATAL] [1734049001.298668402] [spawner_xxx_controller]: Failed loading controller my_controller

my_controller doesn't exist in your spawner args.

We cannot help you much in this case. In my opinion, the dictionary approach is much robust than the earlier one. If you find a solution, feel free to open a PR.

Thank you!

@bijoua29
Copy link
Contributor Author

my_controller doesn't exist in your spawner args.

Yeah, this is just me trying to scrub any of our user-specific information from the logs

We cannot help you much in this case. In my opinion, the dictionary approach is much robust than the earlier one. If you find a solution, feel free to open a PR.

I'm actually following the dictionary approach but explicitly in our launch file as we are using OpaqueFunction and not LaunchDescription directly and so can't use the utility function in launch_utils.py.

I guess you have no more guidance on how to get better log information. I have tried enabling the service introspection feature but that didn't yield any additional information.

I enabled warning log level on the spawner and I get a "Failed to get result" message indicating the service response is not getting back to spawner within the timeout of 10s (I have increase this to 20s with no improvement). My assumption based on this error message is that the service server in controller manager is not discovering the spawner's service client reader for the service response and so it can't sent the response. I will try to see if I can enable any rmw-based log to validate this assumption although that might be hard to do.

I also have an idea of decreasing the probability of this issue with a change to the spawner and so I might submit a PR if that works.

@saikishor
Copy link
Member

@bijoua29 Can you share an example with the OpaqueFunction? I can check if I can extend the functionality to also with that?

@bijoua29
Copy link
Contributor Author

@saikishor I'm not sure there would be a clean way to extend as you would have to provide two functions, one for the launch description and another for the matching launch setup. Anyway, here are a couple of links for the OpaqueFunction pattern:

https://answers.ros.org/question/397123/how-to-access-the-runtime-value-of-a-launchconfiguration-instance-within-custom-launch-code-injected-via-an-opaquefunction-in-ros2/

https://answers.ros.org/question/396345/ros2-launch-file-how-to-convert-launchargument-to-string/

@bijoua29
Copy link
Contributor Author

@saikishor I submitted #1947 as it reduces the failure of controller startup significantly. It is still not foolproof but it helps. The cumulative effect of using a single spawner, activating as a group and the PR I submitted improved the success rate significantly. It also reduced the startup time significantly. It still isn't foolproof as I said so I'm still pursuing the rmw angle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants