One or more controllers fail to start up when starting all of them simultaneously #1934

bijoua29 · 2024-12-12T00:08:58Z

Describe the bug
When starting several controllers simultaneously using the spawner, one or more of the controllers fail to start up

To Reproduce
Steps to reproduce the behavior:

Start our launch file with several controllers(in our case, we start 16 controllers)
See following error:

[ros2_control_node-2] [ERROR] [1733859175.169085071] [controller_manager]: A controller named 'my_controller_1' was already loaded inside the controller manager
[ros2_control_node-2] [ERROR] [1733859175.491164152] [controller_manager]: A controller named 'my_controller_2' was already loaded inside the controller manager
[ros2_control_node-2] [ERROR] [1733859175.491519034] [controller_manager]: A controller named 'my_controller_3' was already loaded inside the controller manager
[ros2_control_node-2] [ERROR] [1733859175.492405566] [controller_manager]: A controller named 'my_controller_1' was already loaded inside the controller manager
[my_controller_1_spawner-13] [FATAL] [1733859175.539516796] [spawner_my_controller_1]: Failed loading controller my_controller_1
[my_controller_2_spawner-9] [FATAL] [1733859175.577969123] [spawner_my_controller_2]: Failed loading controller my_controller_2
[my_controller_3_spawner-11] [FATAL] [1733859175.578179280] [spawner_my_controller_3]: Failed loading controller my_controller_3
[ERROR] [my_controller_3_spawner-11]: process has died [pid 3598950, exit code 1, cmd 'ros2 run controller_manager spawner my_controller_3 --param-file /tmp/my_controller_3.yaml --controller-manager-timeout 40 --ros-args --log-level ERROR --log-file-name my_controller_3_spawner'].
[ERROR] [my_controller_2_spawner-9]: process has died [pid 3598944, exit code 1, cmd 'ros2 run controller_manager spawner my_controller_2 --param-file /tmp/my_controller_2.yaml --controller-manager-timeout 40 --ros-args --log-level ERROR --log-file-name my_controller_2_spawner'].
[ERROR] [my_controller_1_spawner-13]: process has died [pid 3598954, exit code 1, cmd 'ros2 run controller_manager spawner my_controller_1 --param-file /tmp/my_controller_1.yaml --controller-manager-timeout 40 --ros-args --log-level ERROR --log-file-name my_controller_1_spawner'].

Expected behavior
All controllers come up without error

Screenshots
N/A

Environment (please complete the following information):

OS: Ubuntu 24.04
Version: rolling sync from 2024-11-26
ros2_control version: 4.20.0

Additional context
The number of controllers that don't load is random. Out of our 16 controllers, we have 7 instances of a particular type of a custom controller. Anecdotally, it is usually one or more these controllers that don't load up.

I can't really produce a minimal example. In fact, we only see this on our target hardware and believe it is related to CPU load.

I can't repro this on my laptop as it is much more powerful than the target hardware. Additionally, I have tried to preload my laptop with additional load using 'stress' but I still couldn't repro the issue so it may be something else e.g. disk I/O.

I don't understand the error message as it indicates the controller is already loaded but it isn't.

Note; we religiously update our software to the latest rolling sync every month.

Anecdotally, it seems there was some improvement when we went from ros2_control version 4.18(~10% success rate) to 4.20(~40-50% success rate).

It seems this issue has cropped up only in the last couple of months. A few months ago, failures were fairly rare. The above error message also is something that is fairly new to us as previously the occasional failures just had the controller "process has died" message. I realize there has been a fair amount of work in the spawner area recently.

This is a very serious issue for us as it takes several tries for our SQA to start our software stack when they are testing.

The text was updated successfully, but these errors were encountered:

saikishor · 2024-12-12T11:26:34Z

Hello @bijoua29!

We will need more debug information or logs for this particular case. It is very hard to pin point the issue you are mentioning.

Why don't you spawn all the above controllers together in a single spawner?. This is the exact reason the PR #1805 is implemented. You can parse multiple controllers and multiple param files to the spawner and it should be able to handle everything.

You can also use this part of the code :

ros2_control/controller_manager/controller_manager/launch_utils.py

Lines 103 to 139 in bb087e2

    
           def generate_controllers_spawner_launch_description_from_dict( 
        
               controller_info_dict: dict, extra_spawner_args=[] 
        
           ): 
        
               """ 
        
               Generate launch description for loading a controller using spawner. 
        
               controller_info_dict: dict 
        
                   A dictionary with the following info: 
        
                   - controller_name: str 
        
                       The name of the controller to load as the key 
        
                   - controller_params_file: str or list or None 
        
                       The path to the controller parameter file or a list of paths to multiple parameter files 
        
                       or None if no parameter file is needed as the value of the key 
        
                       If a list is passed, the controller parameters will be overloaded in same order 
        
               extra_spawner_args: list 
        
                   A list of extra arguments to pass to the controller spawner 
        
               """ 
        
               if not type(controller_info_dict) is dict: 
        
                   raise ValueError(f"Invalid controller_info_dict type parsed {controller_info_dict}") 
        
               controller_names = controller_info_dict.keys() 
        
               controller_params_files = [] 
        
               for controller_name in controller_names: 
        
                   controller_params_file = controller_info_dict[controller_name] 
        
                   if controller_params_file: 
        
                       if type(controller_params_file) is list: 
        
                           controller_params_files.extend(controller_params_file) 
        
                       elif type(controller_params_file) is str: 
        
                           controller_params_files.append(controller_params_file) 
        
                       else: 
        
                           raise ValueError( 
        
                               f"Invalid controller_params_file type parsed in the dict {controller_params_file}" 
        
                           ) 
        
               return generate_controllers_spawner_launch_description( 
        
                   controller_names=controller_names, 
        
                   controller_params_files=controller_params_files, 
        
                   extra_spawner_args=extra_spawner_args, 
        
               )

and just pass a dictionary of elements and it should work better and more robust in my opinion. Take a look and let us know.

bijoua29 · 2024-12-12T14:10:16Z

@saikishor Yes, I understand it is hard to pinpoint this issue. I am willing to generate more debug or log data but I will need to know exactly what. Meanwhile I will try the single spawner. Do you have an example of its usage in a launch file that I can look at?

One question for my curiosity. Does the single spawner generate a single service call to the controller manager or still multiple service calls. If multiple, then I fear I might still see the issue.

bijoua29 · 2024-12-12T14:41:23Z

We use OpaqueFunction for our LaunchDescription:

return LaunchDescription(
        [SetEnvironmentVariable("ROS_LOG_DIR", launch_config.log_dir)]
        + declared_arguments
        + [OpaqueFunction(function=launch_setup)]

so I don't think I can use the launch utilities in launch_utils.py but I can do the same thing as done in generate_controllers_spawner_launch_description_from_dict() in launch_setup

bijoua29 · 2024-12-12T14:48:17Z

Ok, I did take a look at the spawner code and it does generate a separate service call for each controller. However, since it is a single spawner node, I think it will be more robust as it there is only DDS discovery required for a single node. Whereas for multiple spawners, each one had to have its service client subscriber discovered by client manager, which led to problems under load. So I am hoping this will lead to a more robust startup for us.

saikishor · 2024-12-12T15:19:50Z

Ok, I did take a look at the spawner code and it does generate a separate service call for each controller. However, since it is a single spawner node, I think it will be more robust as it there is only DDS discovery required for a single node. Whereas for multiple spawners, each one had to have its service client subscriber discovered by client manager, which led to problems under load. So I am hoping this will lead to a more robust startup for us.

Awesome. Glad to hear that it does help with this single spawner. If you want things to be done in a single service call, pass --activate-as-group parameter in the extra args.

Please let us know if this works for you?

bijoua29 · 2024-12-12T15:26:44Z

@saikishor Thanks for the response. That's nice you can do it in a single service call. Spawner has changed (for the better) quite a bit since I last looked at it. There are a few things for me to try. Since I have to test this in the target environment, and make changes manually there, the testing will take quite some time. I will let you know my results. I'm really hopeful this will work for me.

saikishor · 2024-12-12T15:45:31Z

@bijoua29 We thought of these kind of use cases in the first place and then stated improving it. It would be really great to see that it helps someone hahhaa

bijoua29 · 2024-12-12T22:41:16Z

@saikishor So some of our controllers are meant to be loaded and started as inactive while the rest should be started as active. How would I start some of the controllers as inactive when using a single spawner?

saikishor · 2024-12-12T23:16:20Z

@saikishor So some of our controllers are meant to be loaded and started as inactive while the rest should be started as active. How would I start some of the controllers as inactive when using a single spawner?

You need to have separate single spawners for the group of controllers with the same desired initial state

bijoua29 · 2024-12-12T23:29:34Z

That's what I thought. I started doing it that way.

bijoua29 · 2024-12-13T01:00:30Z

So I converted the controllers startup to 2 spawners, one for controllers to start active and another to start inactive. Unfortunately, I still see failures starting up. The success rate is either the same as before or maybe even slightly worse. There was one benefit for the change to a single spawner - the startup time was slightly less.

There seem to be different failure modes. Here are examples of them:

Sometimes, it fails silently, The spawner doesn't finish and there are no error messages. That is, the following message is not seen for the spawner.

[INFO] [spawner-7]: process has finished cleanly [pid 8031]

In this scenario, all the controllers that were started by the spawner are not loaded

[ros2_control_node-2] [ERROR] [1734049001.249673796] [controller_manager]: A controller named 'yyy_controller' was already loaded inside the controller manager
[spawner-7] [FATAL] [1734049001.298668402] [spawner_xxx_controller]: Failed loading controller my_controller
[ERROR] [spawner-7]: process has died [pid 11553, exit code 1, cmd '/home/user/workspace/ws/install/controller_manager/lib/controller_manager/spawner xxx_controller yyy_controller zzz_controller aaa_controller bbb_controller --controller-manager controller_manager --param-file /tmp/params_controller.yaml --controller-manager-timeout 40 --inactive --ros-args --log-level ERROR --log-file-name active_controllers_spawner --ros-args'].

[ros2_control_node-2] [ERROR] [1734049383.909486794] [controller_manager]: Aborting, no controller is switched! (::STRICT switch)
[spawner-6] [ERROR] [1734049384.001362490] [spawner_joint_state_broadcaster]: Failed to activate controller
[ERROR] [spawner-6]: process has died [pid 22629, exit code 1, cmd '/home/user/workspace/ws/install/controller_manager/lib/controller_manager/spawner joint_state_broadcaster controller_1 controller_2 controller_3 controller_4 controller_5 controller_6 controller_7 controller_a controller_b controller_c --controller-manager controller_manager --param-file params_1.yaml --param-file params_2yaml --param-file params_3.yaml --param-file params_4.yaml --param-file params_5.yaml --param-file params_6.yaml --param-file params_7.yaml --controller-manager-timeout 40 --ros-args --log-level ERROR --log-file-name active_controllers_spawner --ros-args'].

At this point, I am looking for guidance on how to further instrument the controller startup to determine where the problem is.

saikishor · 2024-12-15T09:08:39Z

@bijoua29 the logs you are providing are not of much help. we need more information.

[ros2_control_node-2] [ERROR] [1734049383.909486794] [controller_manager]: Aborting, no controller is switched! (::STRICT switch)

This means something has happened and the main information is before it

[spawner-7] [FATAL] [1734049001.298668402] [spawner_xxx_controller]: Failed loading controller my_controller

my_controller doesn't exist in your spawner args.

We cannot help you much in this case. In my opinion, the dictionary approach is much robust than the earlier one. If you find a solution, feel free to open a PR.

Thank you!

bijoua29 · 2024-12-15T22:43:05Z

my_controller doesn't exist in your spawner args.

Yeah, this is just me trying to scrub any of our user-specific information from the logs

We cannot help you much in this case. In my opinion, the dictionary approach is much robust than the earlier one. If you find a solution, feel free to open a PR.

I'm actually following the dictionary approach but explicitly in our launch file as we are using OpaqueFunction and not LaunchDescription directly and so can't use the utility function in launch_utils.py.

I guess you have no more guidance on how to get better log information. I have tried enabling the service introspection feature but that didn't yield any additional information.

I enabled warning log level on the spawner and I get a "Failed to get result" message indicating the service response is not getting back to spawner within the timeout of 10s (I have increase this to 20s with no improvement). My assumption based on this error message is that the service server in controller manager is not discovering the spawner's service client reader for the service response and so it can't sent the response. I will try to see if I can enable any rmw-based log to validate this assumption although that might be hard to do.

I also have an idea of decreasing the probability of this issue with a change to the spawner and so I might submit a PR if that works.

saikishor · 2024-12-16T18:06:43Z

@bijoua29 Can you share an example with the OpaqueFunction? I can check if I can extend the functionality to also with that?

bijoua29 · 2024-12-16T19:35:56Z

@saikishor I'm not sure there would be a clean way to extend as you would have to provide two functions, one for the launch description and another for the matching launch setup. Anyway, here are a couple of links for the OpaqueFunction pattern:

https://answers.ros.org/question/397123/how-to-access-the-runtime-value-of-a-launchconfiguration-instance-within-custom-launch-code-injected-via-an-opaquefunction-in-ros2/

https://answers.ros.org/question/396345/ros2-launch-file-how-to-convert-launchargument-to-string/

bijoua29 · 2024-12-16T21:22:31Z

@saikishor I submitted #1947 as it reduces the failure of controller startup significantly. It is still not foolproof but it helps. The cumulative effect of using a single spawner, activating as a group and the PR I submitted improved the success rate significantly. It also reduced the startup time significantly. It still isn't foolproof as I said so I'm still pursuing the rmw angle.

bijoua29 added the bug label Dec 12, 2024

bijoua29 mentioned this issue Dec 16, 2024

Reuse spawner service clients #1947

Closed

saikishor mentioned this issue Dec 16, 2024

Use singleton approach to store and reuse the service clients #1949

Merged

christophfroehlich closed this as completed in #1949 Dec 17, 2024

mergify bot mentioned this issue Dec 17, 2024

Use singleton approach to store and reuse the service clients (backport #1949) #1953

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One or more controllers fail to start up when starting all of them simultaneously #1934

One or more controllers fail to start up when starting all of them simultaneously #1934

bijoua29 commented Dec 12, 2024 •

edited

Loading

saikishor commented Dec 12, 2024 •

edited

Loading

bijoua29 commented Dec 12, 2024 •

edited

Loading

bijoua29 commented Dec 12, 2024 •

edited

Loading

bijoua29 commented Dec 12, 2024

saikishor commented Dec 12, 2024

bijoua29 commented Dec 12, 2024

saikishor commented Dec 12, 2024 •

edited

Loading

bijoua29 commented Dec 12, 2024

saikishor commented Dec 12, 2024

bijoua29 commented Dec 12, 2024

bijoua29 commented Dec 13, 2024 •

edited

Loading

saikishor commented Dec 15, 2024

bijoua29 commented Dec 15, 2024

saikishor commented Dec 16, 2024

bijoua29 commented Dec 16, 2024

bijoua29 commented Dec 16, 2024

One or more controllers fail to start up when starting all of them simultaneously #1934

One or more controllers fail to start up when starting all of them simultaneously #1934

Comments

bijoua29 commented Dec 12, 2024 • edited Loading

saikishor commented Dec 12, 2024 • edited Loading

bijoua29 commented Dec 12, 2024 • edited Loading

bijoua29 commented Dec 12, 2024 • edited Loading

bijoua29 commented Dec 12, 2024

saikishor commented Dec 12, 2024

bijoua29 commented Dec 12, 2024

saikishor commented Dec 12, 2024 • edited Loading

bijoua29 commented Dec 12, 2024

saikishor commented Dec 12, 2024

bijoua29 commented Dec 12, 2024

bijoua29 commented Dec 13, 2024 • edited Loading

saikishor commented Dec 15, 2024

bijoua29 commented Dec 15, 2024

saikishor commented Dec 16, 2024

bijoua29 commented Dec 16, 2024

bijoua29 commented Dec 16, 2024

bijoua29 commented Dec 12, 2024 •

edited

Loading

saikishor commented Dec 12, 2024 •

edited

Loading

bijoua29 commented Dec 12, 2024 •

edited

Loading

bijoua29 commented Dec 12, 2024 •

edited

Loading

saikishor commented Dec 12, 2024 •

edited

Loading

bijoua29 commented Dec 13, 2024 •

edited

Loading