Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPPI ARM Binaries Issue in RPi4 #4380

Open
avanmalleghem opened this issue May 30, 2024 · 20 comments
Open

MPPI ARM Binaries Issue in RPi4 #4380

avanmalleghem opened this issue May 30, 2024 · 20 comments

Comments

@avanmalleghem
Copy link

  • Operating System: Yocto based - on a Jetson Nano
  • ROS2 Version: Humble (yocto recipes)
  • Version or commit hash: 1.1.14
  • DDS implementation: CycloneDDS

Steps to reproduce issue

I use MPPI Controller to navigate with my real robot and observe a really strange behavior. For the sake of this issue, I removed obstacle layer, the velocity smoother and I send goal where only linear velocity is needed. It is a differential drive robot.

  • When I start the robot and stay in the initial orientation, sending a goal works like a charm. Observe, on the following image, the map and odom axes, the plan and the robot moving in the correct direction.
    image
  • If I manually turn the robot in 90° and then send a goal, it still works (observe axes, plan and direction) :
    image
    image
  • BUT, if I manually turn the robot in 180° and send a goal, it goes in the wrong direction. For example, in the following image, the robot goes backward. You can even see trajectories in blue behind the robot. The opposite is true aswell : if I send a goal behind the robot, it goes forward.
    image

Here is my nav2 configuration for controller_server :

controller_server:
  ros__parameters:
    odom_topic: odometry/filtered_odom
    min_x_velocity_threshold: 0.001
    min_y_velocity_threshold: 0.5
    min_theta_velocity_threshold: 0.001
    debug_trajectory_details: true
    failure_tolerance: 0.3
    progress_checker_plugin: "progress_checker"
    goal_checker_plugins: ["goal_checker"]
    controller_plugins: ["FollowPath"]
    progress_checker:
      plugin: "nav2_controller::SimpleProgressChecker"
      required_movement_radius: 0.5
      movement_time_allowance: 10.0
    goal_checker:
      plugin: "nav2_controller::SimpleGoalChecker"
      xy_goal_tolerance: 1.5
      yaw_goal_tolerance: 6.28
      stateful: True
    FollowPath:
      plugin: "nav2_mppi_controller::MPPIController"
      time_steps: 28
      model_dt: 0.05
      batch_size: 500
      vx_std: 0.2
      vy_std: 0.0
      wz_std: 0.4
      vx_max: 2.0
      vx_min: -0.5
      vy_max: 0.0
      wz_max: 3.0
      iteration_count: 1
      prune_distance: 5.0
      transform_tolerance: 0.1
      temperature: 0.3
      gamma: 0.015
      motion_model: "DiffDrive"
      visualize: true
      reset_period: 1.0 # (only in Humble)
      regenerate_noises: false
      TrajectoryVisualizer:
        trajectory_step: 5
        time_step: 3
      critics: ["ConstraintCritic", "PathAlignCritic", "PathFollowCritic"]
      ConstraintCritic:
        enabled: true
        cost_power: 1
        cost_weight: 4.0
      PathAlignCritic:
        enabled: true
        cost_power: 1
        cost_weight: 5.0
        max_path_occupancy_ratio: 0.05
        trajectory_point_step: 4
        threshold_to_consider: 0.0
        offset_from_furthest: 7
        use_path_orientations: false
      PathFollowCritic:
        enabled: true
        cost_power: 1
        cost_weight: 10.0
        offset_from_furthest: 7
        threshold_to_consider: 0.0
@SteveMacenski
Copy link
Member

SteveMacenski commented May 30, 2024

I think some videos here would be more illustrative. I'm not entirely sure I understand what you're describing 😥

How is the path created? Can you reproduce this on the nav2_bringup robot setup? What happens if you add in the full suite of critics?

on a Jetson Nano

Not that I think this is the issue, but woof, I'd love to hear how well this actually works on a Jetson Nano. That's got to be eating your CPU alive.

 vx_max: 2.0; time_steps: 28; batch_size: 500

Here nor there for the ticket, but I'm be concerned with these settings moving that fast

@avanmalleghem
Copy link
Author

avanmalleghem commented May 31, 2024

Thanks for your answer, 1 day of troubleshooting later, I found something really strange I would like to share with you. I tried several configurations based on Gazebo and the result is different depending on the FollowPath plugin I use and if I run navigation nodes on my laptop (ubuntu 22.04) or on the Jetson Nano (yocto based using kirkstone).

To be more accurate :

Laptop Jetson
MPPI OK NOK
DWB OK OK

Steps to reproduce the NOK :

  • Start navigation using MPPI plugin on a jetson nano yocto based
  • Manually turn your robot in 180° (yaw axis)
  • Send a command 3m forward without any obstacles
  • Observe that it goes backward instead of forward
movie.mp4

For your information,

  • I also tried to install MPPI dependencies from sources on my laptop to use same versions (xtl 0.7.7, xtensor 0.24.7, xsimd 11.2.0) than on Yocto and it still works so I guess version compatibility isn't the issue...
    EDIT : not sure on the above point because of the final conclusion. Should be tested again to be confirmed. If it works it means the version compatibility issue is on ARM only.
  • I also tried to use it on a Raspberry 4 (64bit ARM, Ubuntu Server 22.04, packages from binaries) and navigation can't start when using MPPI (but it starts with success when using DWB). I'm gonna explore this a bit more in a near future, maybe another issue not related to this one.
    image
  • I also tried to use it on a Jetson Orin Nano (Jetpack 6, packages from binaries) and navigation works like a charm using MPPI.

Based on all these observations, any idea where to explore ?

Not that I think this is the issue, but woof, I'd love to hear how well this actually works on a Jetson Nano. That's got to be eating your CPU alive.
Here nor there for the ticket, but I'm be concerned with these settings moving that fast

To be honest, we first try to make it works and then we will assess performances, CPU usage, how we need to downgrade performance and how it works in a production environment. I can come back to you with our conclusions post-assessment.

How is the path created?

nav2_navfn_planner/NavfnPlanner output

What happens if you add in the full suite of critics?

Same behavior

@SteveMacenski
Copy link
Member

Did you try compiling MPPI from source and still have the same crash on the RPi?

Getting a backtrace on the crash would be helpful to see what's failing https://docs.nav2.org/tutorials/docs/get_backtrace.html We had an issue long ago where binaries would cause a crash due to incompatible build flags on build farm's computers relative to what normal x86 machines had (#3767) and curious if the same is happening now for ARM and we need to find what instructions might not exist. Read through that thread in detail for some information and troubleshooting methods that we evaluated during it that is helpful. Giving me your lscpu is also good.

Wrt the 180 deg issue, @pepisg was trouble shooting some Jetson MPPI issue and I don't think he ever sent me his final report or how we could address it. Might be worth putting your heads together or if this is the same issue that he is thinking about.

What version of Nav2 are you using when compiling from source? How are you getting binaries and what version are those?

Same behavior

Well the crash vs the '180' issue are two very different things, so be specific.

@pepisg
Copy link
Contributor

pepisg commented May 31, 2024

Hi!

I found a similar problem a while ago while building nav2 from source on iron / ARM: The trajectories generated by the controller looked odd, did not seem to try to follow the path even w/o obstacles and only the PathFollow critic active, also the optimal trajectory did not seem to be sampled from the generated trajectories. I think it's the same problem reported here .

I started progressively rolling back changes from #4174 and was able to trace the bug down to the integrateStateVelocities function in optimizer.cpp, particularly to these changes.

I ended up rolling back the PR until having more time to dig deeper.

@avanmalleghem
Copy link
Author

avanmalleghem commented Jun 3, 2024

Raspberry issue

Did you try compiling MPPI from source and still have the same crash on the RPi?

Compiling from source solve the issue on the RPi (I use branch 1.1.14, the version of the latest binaries for humble).

Getting a backtrace on the crash would be helpful to see what's failing

Here it is. I guess I can't have line numbers because it is based on binary installation... Don't hesitate if you have an idea on how to provide additional information.

[INFO] [1717412279.994492029] [controller_server]: Created controller : FollowPath of type nav2_mppi_controller::MPPIController

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.
0x0000ffffec13585c in nav2_mppi_controller::MPPIController::configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tf2_ros::Buffer>, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) () from /opt/ros/humble/lib/libmppi_controller.so
(gdb) backtrace
#0  0x0000ffffec13585c in nav2_mppi_controller::MPPIController::configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tf2_ros::Buffer>, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) () from /opt/ros/humble/lib/libmppi_controller.so
#1  0x0000fffff7c3b2c0 in nav2_controller::ControllerServer::on_configure(rclcpp_lifecycle::State const&) ()
   from /opt/ros/humble/lib/libcontroller_server_core.so
#2  0x0000fffff7ef5208 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#3  0x0000fffff7f01160 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#4  0x0000fffff7eed018 in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::on_change_state(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >) () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#5  0x0000fffff7eee978 in std::_Function_handler<void (std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >), std::_Bind<void (rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::*(rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >&&) () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#6  0x0000fffff7efea24 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#7  0x0000fffff7dab724 in ?? () from /opt/ros/humble/lib/librclcpp.so
#8  0x0000fffff7da91e0 in rclcpp::Executor::execute_service(std::shared_ptr<rclcpp::ServiceBase>) () from /opt/ros/humble/lib/librclcpp.so
#9  0x0000fffff7da9594 in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) () from /opt/ros/humble/lib/librclcpp.so
#10 0x0000fffff7db159c in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/humble/lib/librclcpp.so
#11 0x0000fffff7db17b4 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) () from /opt/ros/humble/lib/librclcpp.so
#12 0x0000aaaaaaaa18d0 in ?? ()
#13 0x0000fffff77b73fc in __libc_start_call_main (main=main@entry=0xaaaaaaaa17c0, argc=argc@entry=4, argv=argv@entry=0xffffffffea28)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#14 0x0000fffff77b74cc in __libc_start_main_impl (main=0xaaaaaaaa17c0, argc=4, argv=0xffffffffea28, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:392
#15 0x0000aaaaaaaa1b30 in ?? ()

Read through that thread in detail for some information and troubleshooting methods that we evaluated during it that is helpful.

  • I tried to reinstall -> doesn't work
  • I tried to remove critics -> doesn't work

I don't know if any other test is relevant ? The issue seems to be in the "configure" method. Any idea ?

Giving me your lscpu is also good.

Architecture:            aarch64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A72
    Model:               3
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r0p3
    CPU max MHz:         1800.0000
    CPU min MHz:         600.0000
    BogoMIPS:            108.00
    Flags:               fp asimd evtstrm crc32 cpuid
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   192 KiB (4 instances)
  L2:                    1 MiB (1 instance)
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; __user pointer sanitization
  Spectre v2:            Vulnerable
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Jetson issue

What version of Nav2 are you using when compiling from source? How are you getting binaries and what version are those?

1.1.14. I use meta-ros and consequently this recipe. To be more accurate, I resolve libomp-dev by using libgomp (I don't know if it can be an issue). And so reading the recipe :

I tried to build nav2_mppi_controller on the jetson nano directly without using yocto to check if the issue is yocto related and run into an issue you can find here : mppi-build-jeton-error.txt (the error is huge so I don't know how to share it another way).

@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 4, 2024

RPi

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.

That looks like the issue from the previous ticket I linked to. Any important flags look missing between your CPU and the build farm's? https://build.ros2.org/job/Hbin_ujv8_uJv8__nav2_mppi_controller__ubuntu_jammy_arm64__binary/43/consoleFull#console-section-2

Seems like a flag in the build farm is being used that isn't valid for the RPi just like we were having with AVX before with AMD64. We can remove that build flag and re-release and that should be that hopefully.

Jetson

I'm not going to dig into custom setups with meta-ros / non-standard rosdep installs of dependencies. There's too many things that can go wrong specific to your situation. @pepisg are you on a Jetson for your issues or are you on another AMR based SOM?

It would be worth looking into the diff that Pedro sent though and see if changing those lines back fixes your problem. That would tell us that this is the same instantiation of the previous issue vs something specific to your Yocto setup. That's something we can dig into more together.

@pepisg
Copy link
Contributor

pepisg commented Jun 4, 2024

@SteveMacenski yeah I'm on a jetson AGX

@avanmalleghem
Copy link
Author

avanmalleghem commented Jun 4, 2024

RPi

Here are the flags on the build farm CPU and not on the RPi (all the flags on the RPi CPU are on the build farm CPU) : aes pmull sha1 sha2 atomics fphp asimdhp asimdrdm lrcpc dcpop asimddp ssbs. To be honest, I don't know how to use this information. I tried to change compile options as you suggested here : #3767 (comment) and switched from
add_compile_options(-O3 -finline-limit=10000000 -ffp-contract=fast -ffast-math -mtune=generic)
to
add_compile_options(-O3 -finline-limit=10000000 -ffp-contract=fast -ffast-math -mtune=generic -maes -mpmull -msha1 -msha2 -matomics -mfphp -masimdhp -masimdrdm -mlrcpc -mdcpop -masimddp -mssbs) and tried to build locally but it is obviously not the way I should work (all flags I added are unrecognized command-line option):

--- stderr: nav2_mppi_controller                                      
c++: error: unrecognized command-line option ‘-maes’
c++: error: unrecognized command-line option ‘-mpmull’; did you mean ‘-mmusl’?
c++: error: unrecognized command-line option ‘-msha1’
c++: error: unrecognized command-line option ‘-msha2’
c++: error: unrecognized command-line option ‘-matomics’
c++: error: unrecognized command-line option ‘-mfphp’
c++: error: unrecognized command-line option ‘-masimdhp’
c++: error: unrecognized command-line option ‘-masimdrdm’
c++: error: unrecognized command-line option ‘-mlrcpc’
c++: error: unrecognized command-line option ‘-mdcpop’
c++: error: unrecognized command-line option ‘-masimddp’
c++: error: unrecognized command-line option ‘-mssbs’
gmake[2]: *** [CMakeFiles/mppi_controller.dir/build.make:76: CMakeFiles/mppi_controller.dir/src/controller.cpp.o] Error 1

Jetson

I tried to use 1.1.12 instead of 1.1.14 (so a version before the PR #4174) and still run into the same issue (the video above).

BUT I solved the issue by removing nav2-mppi-controller recipe from Yocto (and consequently its dependencies, xtl, xtensor and xsimd) and installing everything directly on the generated distro from sources using the right versions for xtl (0.7.2), xsimd (7.6.0) and xtensor (0.23.10).

I suspect an issue related to versions used by Yocto (xtl 0.7.7, xtensor 0.24.7 and xsimd 11.2.0). I will try to use older versions in Yocto and see if it solves the issue (if so, I will create a PR on meta-ros directly).

@SteveMacenski SteveMacenski changed the title MPPI plugin linear velocity direction issue MPPI ARM Binaries Issue Jun 4, 2024
@SteveMacenski
Copy link
Member

SteveMacenski commented Jun 4, 2024

RPi

@nuclearsandwich I don't suppose you are aware already of any RPi-build-farm specific problematic interactions in compiler settings?

@avanmalleghem Its worth looking over that list (aes pmull sha1 sha2 atomics fphp asimdhp asimdrdm lrcpc dcpop asimddp ssbs) and seeing which could plausibly be an issue. We can try to remove them and run a release to narrow down the list and disable the one causing a problem - assuming it doesn't result in some unacceptable perf hits. I think in ARM-world, there's enough variation that some boards are naturally going to have problems (but RPi seems important to support)

Jetson

Ok, seems like then not a problem that we can resolve and you have your answer onto the versions and whatnot to solve that part!

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 3, 2024

@avanmalleghem any update on the build flags and issues?

@avanmalleghem
Copy link
Author

To be honest, I don't know how to proceed. I can do so but I need some guidelines/links to follow.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 4, 2024

Some pattern that might help:

  • The Jetson binaries work, yeah? In that case, it has whatever flags you require that are missing on the RPi
  • So, you can 1:1 disable the flags in your list in the software, compile it on the Jetson, and then try to execute it on the RPi
  • You should be able to find which one(s) you need to disable which make it work on the RPi and we can disable that in the build farm job for binaries on ARM

A way to speed that up would be that if you compile on the Jetson and transfer to the RPi compiling with debug flags, you can get the exact instruction that is failing with GDB and you can look up where that comes from. The Nav2 tutorial will get you the first mile with compiling with GDB and getting a backtrace https://docs.nav2.org/tutorials/docs/get_backtrace.html and other documentation can show you how to get the instruction that failed in an illegal instruction seg fault in GDB.

I'll say that the flags that imply or specifically mention "simd" in their names make me suspicious. Does RPi support simd? If not in general, that could point to a potential itll-never-work issue. What compiler flags does the RPi4 have in general (any of those AVX/SIMD)?

@SteveMacenski SteveMacenski changed the title MPPI ARM Binaries Issue MPPI ARM Binaries Issue in RPi4 Jul 4, 2024
@avanmalleghem
Copy link
Author

avanmalleghem commented Jul 11, 2024

Here is the full stacktrace when building on Orin Nano and deploying on RPI :
image

I tried to disable "simd" like flags (asimdhp, asimdrdm and asimddp) but I don't know how to do so... Using for example add_compile_options(-mno-asimdhp) doesn't seem to be an option. How can I disable the flags in the software ?

In the same time, I get compilation flags used (by adding -frecord-gcc-switches to compiler flags) :

  • On the Orin Nano, readelf -p .GCC.command.line install/nav2_mppi_controller/lib/libmppi_controller.so returns :
    image

  • on the RPI4, readelf -p .GCC.command.line install/nav2_mppi_controller/lib/libmppi_controller.so returns :
    image

@nuclearsandwich
Copy link
Contributor

@nuclearsandwich I don't suppose you are aware already of any RPi-build-farm specific problematic interactions in compiler settings?

Negative. Although our ARM builds are run on AWS Graviton instances so we don't have any RPi hardware on the official build farm.

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 15, 2024

Two things:

  • First, I see that this references C++11 but the minimum required version are C++17 for Humble https://www.ros.org/reps/rep-2000.html#humble-hawksbill-may-2022-may-2027. Perhaps you need to have that in the RPi?
  • I see that the path is atomicity which is one of the build flags you showed that the RPi didn't have. That may be the culprit, or some feature used by C++17 on the Orin isn't available on the C++11 RPi. Its worth compiling without that on the Orin and seeing if it now works (or if you can even compile it without that, it may be a required feature)

Thanks Steven, I didn't think you had RPis in the farm, but didn't know if you had some other RPi-specific reported issues with binaries before that rhymed with this.

@tomTingle
Copy link

Hi
not sure if it helps, but we can reproduce the issue also on the x86 architecture.

I have the following laptop spec:

image

We run everything in a nix shell, but build the whole nav2 stack locally from source (with CXX 17). I can reproduce it with the versions 1.1.13, 1.1.14 and 1.1.15.

(xsimd-11.1.0, xtensor-0.24.7, xtl-0.7.5) with version 1.1.14

@aatb-ch
Copy link

aatb-ch commented Aug 6, 2024

just chiming as I also see this issue on a pi4 running 22.04 + Humble, nav2 fails to launch with the same error using MPPI, controller_server dies on launch. After compiling from sources running all fine. Seems like same issue as #4380 also.

@SteveMacenski
Copy link
Member

@aatb-ch happy to have the help - if you look up this thread I lay out the items needed to debug where this is coming from to potentially resolve for the binaries. I don't have an RPi4 to reproduce, so someone with one that needs to use it will need to help here if we want to make any progress :-)

@avanmalleghem
Copy link
Author

I'm not using RPi4 but I have one so I can try to continue troubleshooting this issue.

First, I see that this references C++11 but the minimum required version are C++17 for Humble https://www.ros.org/reps/rep-2000.html#humble-hawksbill-may-2022-may-2027. Perhaps you need to have that in the RPi?

To be sure I compile with C++17, I build MPPI on the Orin Nano using -std=c++17. The version of GCC and G++ are the same on RPi and on Orin Nano (11.4)

  • I see that the path is atomicity which is one of the build flags you showed that the RPi didn't have. [...] Its worth compiling without that on the Orin and seeing if it now works (or if you can even compile it without that, it may be a required feature)

How can I compile without atomicity ? I tried to but can't find the right way to do so. I guess some compile flags but which one ?

@SteveMacenski
Copy link
Member

This should permanently go away with #4621 once merged, as it completely removes xtensor in favor of Eigen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants