Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTActionNode does not abort #1652

Closed
maxlein opened this issue Apr 24, 2020 · 27 comments
Closed

BTActionNode does not abort #1652

maxlein opened this issue Apr 24, 2020 · 27 comments

Comments

@maxlein
Copy link
Contributor

maxlein commented Apr 24, 2020

I am wondering about these lines here:
https://github.com/ros-planning/navigation2/blob/b01810bcf5da97620afa53ec047f54f34c80c3e1/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L221-L223

If the goal is aborted, the result is veiled as it would have never reached the client and thus abort() won't get called:

https://github.com/ros-planning/navigation2/blob/b01810bcf5da97620afa53ec047f54f34c80c3e1/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L160-L165

Maybe someone can explain the logic behind that, I don't get it.
It took me a while to see that, because I was thinking the action service itself is somehow misbehaving on network layer.

If I remove the if clause, everything works as expected when for example the controller is aborting.

@SteveMacenski
Copy link
Member

That is a valid point. When I designed this, I was thinking that the action servers only abort when preempted. This check is added because when a server is preempted (like when a controller gets a new path from the global planner at 1hz) the return code is abort.

I think this is a good bug to consider. How do you propose dealing with preemption to also differ from aborted because the server itself decided to fail?

@maxlein
Copy link
Contributor Author

maxlein commented Apr 27, 2020

Well I think the server has to tell why it aborted, otherwise it's just guessing and assuming on the client side.
I would just replace the action result std_msgs/Empty result with something like an error code.

@SteveMacenski
Copy link
Member

PR and discussions for implementation are welcome.

@maxlein
Copy link
Contributor Author

maxlein commented Apr 28, 2020

Looking a bit into this and it could be a bigger change than expected.
The templated SimpleActionServer is used everywhere in nav2 with different actions and so different result msgs.
The most minimalistic version I can think of would be to adapt the accept_pending_goal function.
Specifically the abort call here:
https://github.com/ros-planning/navigation2/blob/bc6a41d08dbf1c36c0c4e32b0a32393194d12ffd/nav2_util/include/nav2_util/simple_action_server.hpp#L246-L249

So by customizing this call, one could at least distinguish between a preempted goal and an erroneous server.

Another possibility would be to pass a specific result to the terminate functions.
Then there is no change on the SimpleActionServer required but the action servers all need to pass a specific error code as a result.
https://github.com/ros-planning/navigation2/blob/bc6a41d08dbf1c36c0c4e32b0a32393194d12ffd/nav2_util/include/nav2_util/simple_action_server.hpp#L363-L366

The client has to assume a preemption by checking for an empty result msg though.

@SteveMacenski
Copy link
Member

SteveMacenski commented Apr 28, 2020

Another possibility would be to pass a specific result to the terminate functions.

What would that solution look like? That actually sounds very reasonable and I've long felt that was necessary. We didn't do it before because we didn't have action server return types before. This PR #1586 adds feedback to all the action servers (which I've been lazy have haven't closed up, I need to do that soon). Having a PR to implement result types is totally reasonable and should be I'd argue should be done. Does that resolve the preemption vs failure conditions for returning the abort type to the Bt nodes? I suppose the status codes could be the same between all messages/action servers. Maybe make a nav2 Status message type that each has their return type of, and the template on the BT node should be able to handle that. Ah, actually, I don't think they would have to all be the same, since the BT action node implementations have the on_success method where they can look at the result without having to push that check in the underlying bt_action_node.

For argument's sake, if we did make them all use the Status and that was relied on, we would just need to document on the website & in the migration guide that all uses of the nav2 behavior tree actions need to make use of a Status result type in the action message definition (it may have more, but must include this).

I'm not sure I understand the first suggestion you mentioned. Can you clarify if you feel if its a better solution than the terminate one? The best that I can tell, this suggests making the accept_pending_goal method virtual for each implementation to deal with, which is probably not a good solution. I'm not sure though exactly what you propose here. Or do you mean "customize this call" by returning not the empty_result() but an actual result? Is that much different than the terminate suggestion? They both would involve including a non-empty result message of the action and having somewhere in there a status specify the reason for returning (actual failure, timeout, preempted, etc). Should we do both?

@SteveMacenski
Copy link
Member

@maxlein any update?

@QQting
Copy link
Contributor

QQting commented May 5, 2020

Hi all,

I met the same issue in #1683, and I found the ResultCode is ABORTED when not only the controller_server receives a new goal but also the planner_server fails to create a new path.

https://github.com/ros-planning/navigation2/blob/b01810bcf5da97620afa53ec047f54f34c80c3e1/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L221-L223

@maxlein
Copy link
Contributor Author

maxlein commented May 5, 2020

Sorry, lot of work atm ...

Or do you mean "customize this call" by returning not the empty_result() but an actual result?

That's what I meant. Just that we only have to adapt at one place and let all the clients as is. It's less work to do and something like a quick fix but without error codes doesn't make sense looking at the nav stack in a more general way.

They both would involve including a non-empty result message of the action and having somewhere in there a status specify the reason for returning (actual failure, timeout, preempted, etc). Should we do both?

We have two options I think:

  1. Having one status msg for all nodes
  • the default return value of the status msg could be set to "preempted" so that accept_pending_goal stays as it is or check it for the preempted state if not set to preempted by default
  1. Having customized status msgs:
  • need to make the accept_pending_goal method virtual for each implementation or have a constraint for status msgs that the default value has to be preempted

Personally I don't like to have any assumptions or constraints one has to look for but depends if it could save a lot of refactoring.

As an example, if we have just one status msg, we would need to change only:

https://github.com/ros-planning/navigation2/blob/6fb43d4a0543897d29d38b3e4a749b6bea6108e1/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_node.hpp#L221

to check for the preempted state.
And the nodes which want to handle error states can implement on_aborted().

In my test I have a NavigationResult.msg which looks like that for the moment:

int8 NO_ERROR = 0  or preempted
int8 GENERIC_ERROR = 1
int8 BLOCKED_BY_OBSTACLE_ERROR = 2

int8 status

The question is also if there is a need for customized result states, or if we just have a bunch of states in there? And if someone needs an extra state, add it and make a PR or build nav2_msgs from source.

@SteveMacenski
Copy link
Member

SteveMacenski commented May 5, 2020

I'm a little unplugged from this issue for the moment. Forgive me if below is obvious.

The ending take a result (e.g. terminate_current, terminate_all, succeeded_current, accept_pending_goal), except accept pending goal, with default parameters that no one overrides. All each server would need to do is send their own result and fill it in with whatever context is required to notify about "why" it ended. The BT action node could have an is_valid_result() method that takes the message and returns a bool about "yes, this is an ending that should be processed" or "no, this is a preemption, don't exit". That would replace the ABORTED check.

That would be general and minimal changes required

  • Accept pending goal takes a result type for preemption
  • Each server's terminal conditions need to be given a result
  • Each BT node needs a is_valid_result() override to parse the specific status messages (functionally for our nodes, we could use a nav2_msgs/Status for reducing duplication, but we've made no requirement for it)

I imagine this status just says something like: FAILED, PREEMPTED, SUCCEEDED, CANCELLED (pulling from ROS1 http://docs.ros.org/kinetic/api/actionlib_msgs/html/msg/GoalStatus.html). different from your suggested GENERIC_ERROR, etc. I think these convey the message well.

It's not immediately clear to me why it appears to work for the controller, though. I see in the controller server when the algorithm fails it does a terminate_current and you can easily show in simulation that when the controller fails, it successfully exits and causes recoveries to trigger. @QQting reported that the planner doesn't do that but also terminate_current on failing to compute. @maxlein your custom node (?) also shows that to cause issues. I agree that is should cause issues, so why is the controller working?

That makes me wonder why there is no rclcpp_action preemption distinction, I filed a ticket for that in rclcpp and we'll see what they say (ros2/rclcpp#1104). That's the "best" solution in my mind, because then we just switch the ABORTED to PREEMPTED in the lambda and we're done. All of the other result codes are done with this wrapped result analog to ROS1 actionlib except the preemption code, which is what we really want here. We don't want to have a nav2_msgs/Status that essentially just mirrors that ResultCode if we can help it.

Edit: On your comment about removing the if (result.code != rclcpp_action::ResultCode::ABORTED), is there anything not working by removing that? I recall adding that because the controller would stop without that check. I added that so the goal_result_available_ = true; wasn't called when the controller was still running because of a preemption, causing the controller Bt node to return.

@maxlein
Copy link
Contributor Author

maxlein commented May 6, 2020

That makes me wonder why there is no rclcpp_action preemption distinction, I filed a ticket for that in rclcpp and we'll see what they say (ros2/rclcpp#1104). That's the "best" solution in my mind, because then we just switch the ABORTED to PREEMPTED in the lambda and we're done.

Good point, didn't look that deep. That's the way to go.
But what to do now? As I understand this, ideally we would have the action server "layer" with its action status msg (Canceled, Aborted, Preempted, ..) and on top of that the app "layer" with it's result msg (e.g. couldn't plan, couldn't move my forklift to N inches, etc).
So to also support error codes right now, we would need to have both msgs in the result msg:

int8 PREEMPTED = 0  
int8 ABORTED = 1
int8 SUCCESS = 2
...  

int8 NO_ERROR = 0
int8 COULD_NOT_PLAN = 1  
int8 OBSTACLE_DETECTED = 2
...

int8 action_status
int8 error_code

Edit: On your comment about removing the if (result.code != rclcpp_action::ResultCode::ABORTED), is there anything not working by removing that?

Yes, preemptions are not working ;). But in my case it can't happen so this is not an issue for my "custom" controller right now.

@SteveMacenski
Copy link
Member

SteveMacenski commented May 6, 2020

I think the path forward is

  • Someone interested in this (hoping @maxlein or @QQting or someone from the summer sprint. Me, maybe if necessary) would be willing to implement the proposed changes to rclcpp_actions and submit a PR. Things are much more likely to get traction with a PR.
  • Update our action server wrapper logic to use this preemption method and the BT action node to look for that result code to ignore only.
  • Consider what you mention here additional error codes in a nav2_msgs/Status to give more contextual information about why an action failed. This seems action specific so maybe a standard Status message doesn't make sense and its just baked into the action definition.

The alternative to modifying rclcpp_actions is to:

  • Create the nav2_msgs/Status message with the same information as rclcpp_action ResultCode + preempted.
  • Add to each server action result
  • In each server, when exiting, give it that result object with the right code
  • Add BT action node is_result_valid() virtual function and have that replace the check for ABORTED in the lambda
  • Override the is_result_valid() for each specific BT action now (only a few) to check the code if its preempted, return false, else return true.
  • Consider what you mention here additional error codes in a nav2_msgs/Status to give more contextual information about why an action failed. This seems action specific so maybe a standard Status message doesn't make sense and its just baked into the action definition.

Either way, your mentioning of having the result message tell the caller why it failed is valuable. If you're open to submitting PR to implement that for the planner / controller, that would be valuable. On its own, the BT navigator would ignore them, but at least they're there for use later or if called by other clients.

The alternative is obviously less ideal since we have redundant information in the ResultCode from the action server and the result statuses. Also adds more complexity for the BT designers. Also isn't a generic fix. But the +1 is that its all in our repos we control and can push through.

Thoughts?

@QQting
Copy link
Contributor

QQting commented May 7, 2020

Hi all,

Here #1683 (comment) is what I've tested.

Currently, I don't have any ideas about how to improve this because I'm not familiar enough about rclcpp_action.

By the way,

Edit: On your comment about removing the if (result.code != rclcpp_action::ResultCode::ABORTED), is there anything not working by removing that?

Yes, preemptions are not working ;). But in my case it can't happen so this is not an issue for my "custom" controller right now.

Is it the reason my global plan cannot avoid a dynamic obstacle because preemptions are not working?

@maxlein
Copy link
Contributor Author

maxlein commented May 7, 2020

I may have time to look into rclcpp_actions next week.
This would be my preferred way.

@QQting I don‘t think that has something to do with your problem.
I have this fix as a test only in my fork.

@SteveMacenski
Copy link
Member

@QQting I don't think you actually need to know very much about rclcpp_actions. I've taken some looks at it in the past, but really all you're doing is adding an extra option to the API for preempt(). So really what you can do is track down how abort() works and literally just copy it and rename things to preempt() and add a PREEMPTED to the result codes enum.

I think that's actually a really good intro-to-rclcpp task.

Is it the reason my global plan cannot avoid a dynamic obstacle because preemptions are not working?

I'm not exactly sure what you mean by that. Can you give us a gif or something in your ticket?

@QQting
Copy link
Contributor

QQting commented May 8, 2020

Is it the reason my global plan cannot avoid a dynamic obstacle because preemptions are not working?

I'm not exactly sure what you mean by that. Can you give us a gif or something in your ticket?

@SteveMacenski
After I remove If clause, the global plan seems cannot be preempted, and the result is like below pictures. Global path is the green one, and local path is the red one.
not_avoid_obj_1
not_avoid_obj_2

@SteveMacenski
Copy link
Member

@QQting I can't tell for sure without more analysis but what that looks like to me is that your planner is preempting your controller and since you removed the abort, the controller will exit and you're just driving with the last residual command (unsafely). Your problem would be solved by the solutions in this discussion.

@maxlein
Copy link
Contributor Author

maxlein commented May 11, 2020

Looking at rclcpp it may not be only "adding" the preempt function like the abort is done.
From what I can see, we have dependencies to rcl and if we change rcl, then rclpy will also want these features. So multiple pr's for each repo. So the change itself may not be that much but all the boilerplate around costs much time I don't have right now. Maybe someone can convince the owners of these repos to be something like a missing requirement for actions.

Otherwise we need to take the alternative to modifying rclcpp...

@SteveMacenski
Copy link
Member

Don't expect OR to do this for us. We will likely need to take the initiative and do it. It really shouldn't take all that much time.

@naiveHobo
Copy link
Contributor

ros2/rclcpp#1117 implements the suggested preempt API

@QQting
Copy link
Contributor

QQting commented May 25, 2020

@maxlein @SteveMacenski
Are you going to solve this case by using the new preempt API from rclcpp? I'm on a business trip these days, so I can't help this until this Friday.

@maxlein
Copy link
Contributor Author

maxlein commented May 25, 2020

I think this will take some time. We have to wait for the PR's in rclcpp and rcl_interfaces to go through which will be after the foxy release I think.

@SteveMacenski
Copy link
Member

It is going to take a bit of time. We need to:

  • Add the PREEMPTING message to the interfaces (super fast)
  • Work with OR to update the state machine diagram in the design.ros.org page to include it
  • Implement the bit of state machine logic required for that
  • Review / merge PR for C++ support.

@QQting @maxlein the design part as described here (ros2/rclcpp#1104 (comment)) is another place to help out. If you can help with that, then @naiveHobo's job is much easier.

@Jconn
Copy link
Contributor

Jconn commented Jun 5, 2020

I am running into this issue now on eloquent and foxy, I reproduce like this:
Set a navigation goal through rviz, wait for the goal to fail due to Failed to make progress and abort

Set another navigation goal through rviz, and confirm there is no log output from the controller server.

Sample log output, abort is caused due to a tf timeout:

[bt_navigator-14] [INFO] [1591369067.853691380] [bt_navigator]: Begin navigating from current location to (2.24, -0.34)
[controller_server-11] [INFO] [1591369067.858775641] [controller_server]: Received a goal, begin computing control effort.
[controller_server-11] [INFO] [1591369067.859614374] [controller_server]: Passing new path to controller.
[controller_server-11] [INFO] [1591369067.909798072] [controller_server]: Passing new path to controller.
[controller_server-11] [INFO] [1591369068.909685192] [controller_server]: Passing new path to controller.
[controller_server-11] [INFO] [1591369069.959686889] [controller_server]: Passing new path to controller.
[controller_server-11] [INFO] [1591369070.959683377] [controller_server]: Passing new path to controller.
[controller_server-11] [INFO] [1591369072.009687667] [controller_server]: Passing new path to controller.
[controller_server-11] [INFO] [1591369073.059711623] [controller_server]: Passing new path to controller.
[controller_server-11] [INFO] [1591369074.059691576] [controller_server]: Passing new path to controller.
[controller_server-11] [INFO] [1591369075.109682494] [controller_server]: Passing new path to controller.
[controller_server-11] [INFO] [1591369076.109683507] [controller_server]: Passing new path to controller.
[controller_server-11] [ERROR] [1591369076.159813487] [tf_help]: Transform data too old when converting from odom to map
[controller_server-11] [ERROR] [1591369076.159853557] [tf_help]: Data time: 1591369076s 146044447ns, Transform time: 1591369075s 722273255ns
[controller_server-11] [ERROR] [1591369076.159997115] [controller_server]: Unable to transform robot pose into global plan's frame
[controller_server-11] [WARN] [1591369076.160043059] [controller_server_rclcpp_node]: [follow_path] [ActionServer] Aborting handle.
[bt_navigator-14] [INFO] [1591369088.545809046] [bt_navigator]: Received goal preemption request
[bt_navigator-14] [INFO] [1591369088.545923338] [bt_navigator]: Begin navigating from current location to (1.93, 0.86)
 [bt_navigator-14] [INFO] [1591369208.345823751] [bt_navigator]: Received goal preemption request
[bt_navigator-14] [INFO] [1591369208.345940658] [bt_navigator]: Begin navigating from current location to (1.66, -0.57)

daisukes added a commit to CMU-cabot/navigation2 that referenced this issue Jun 16, 2020
daisukes added a commit to CMU-cabot/navigation2 that referenced this issue Jun 16, 2020
@daisukes
Copy link
Contributor

daisukes commented Jun 16, 2020

I'm running into this issue and have been playing with this.

How about this patch while waiting for the interface change?

Set the result anyway in the callback, and then in the next tick(), check goal_status if it is aborted or not to guess the abortion is caused whether by a new goal or not. If we give a new goal, goal_handle is already replaced with a new one when we get the callback and it must not return ABORTED status, otherwise, we keep the old one which must return ABORTED. This means we can guess the reason for abortion.

master...CMU-cabot:bt-action-node-fix

(updated)

@SteveMacenski
Copy link
Member

What is the goal status when it was given a new goal? This patch looks if the goal handle is aborted and therefore not updated (because if it were updated, it would be not in that state). Is that correct?

Another option would be to compare the goal IDs with get_goal_id to directly check if its a new goal rather than assuming something about the state machine of the action server (which is subject to change or async issues), if that's really what you're looking for - is this a different goal handle.

Not a great solution, but I think it probably gets us temporarily past our issues that I can live with.

daisukes added a commit to CMU-cabot/navigation2 that referenced this issue Jun 18, 2020
@daisukes
Copy link
Contributor

@SteveMacenski
Thank you for your suggestion. It is the simplest solution I think!
I will make a PR with this patch.

daisukes added a commit to CMU-cabot/navigation2 that referenced this issue Jun 18, 2020
daisukes added a commit to CMU-cabot/navigation2 that referenced this issue Jun 18, 2020
SteveMacenski pushed a commit that referenced this issue Jun 18, 2020
@SteveMacenski
Copy link
Member

This is technically fixed in a way that I don't see any medium term shortcomings. I'm going to close this ticket but that's not to say we shouldn't also work on actually implementing preemption in rclcpp_actions. We have several tickets open for that already and once complete we can update our state machine to use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants