feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454

knzo25 · 2024-11-25T08:31:06Z

Description

This PR is part of a series of PRs that aim to accelerate the Sensing/Perception pipeline through an appropriate use of CUDA.

List of PRs:

feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454 (pointcloud preprocessing)
feat(autoware_pointcloud_preprocessor): cuda-accelerated pointcloud concatenation #9455 (concatenation)
feat(autoware_lidar_centerpoint): added the cuda_blackboard to centerpoint #9453 (centerpoint)
transfusion (TODO)
feat(autoware_probabilistic_occupancy_grid_map): cuda accelerated implementation #9542 (OGM - the first implementation will be independent of the blackboard to ease the transition)
feat: acceleration and transport layer tier4/aip_launcher#348 (aip_launcher)
feat: acceleration and transport layer sample_sensor_kit_launch#111 (sample_sensor_kit_launch)

To use these branches, the following additions to the autoware.repos are necessary:

  vendor/cuda_blackboard:
    type: git
    url: [email protected]:knzo25/cuda_blackboard.git
    version: main
  vendor/negotiated:
    type: git
    url: https://github.com/osrf/negotiated.git
    version: master

Depending on your machine and how many nodes are in a container, the following branch may also be required:
https://github.com/knzo25/launch_ros/tree/fix/load_composable_node
There seems to be a but in ROS where if you send too many services at once some will be lost and ros_launch can not handle that.

How was this PR tested?

The sensing/perception pipeline was tested until centerpoint for TIER IV's taxi using the logging simulator.
The following tests were executed in a laptop equipped with a RTX 4060 (laptop) GPU and a Intel(R) Core(TM) Ultra 7 165H (22 cores)

Node / processing time [ms]	Current	PR
/sensing/lidar/top/crop_box_filter_self/debug/processing_time_ms	5.81	N/A
/sensing/lidar/top/crop_box_filter_mirror/debug/processing_time_ms	4.59	N/A
/sensing/lidar/top/distortion_corrector/debug/processing_time_ms	10.96	N/A
/sensing/lidar/top/ring_outlier_filter/debug/processing_time_ms	10.69	N/A
/sensing/lidar/top/cuda_organized_pointcloud_adapter/debug/processing_time_ms	N/A	3.75
/sensing/lidar/top/cuda_pointcloud_preprocessor/debug/processing_time_ms	N/A	1.00
/sensing/lidar/concatenate_data_synchronizer/debug/processing_time_ms	7.83	0.70
Total	38.8	5.45

Notes for reviewers

The main branch that I used for development is feat/cuda_acceleration_and_transport_layer.
However, the changes were too big so I split the PRs. That being said, development, if any will still be on that branch (and then cherrypicked to the respective PRs), and the review changes will be cherrypicked into the development branch.

Interface changes

An additional topic is added to perform type negotiation:
Example: input/pointcloud -> input/pointcloud and input/pointcloud/cuda

Effects on system behavior

Enabling this preprocessing in the launchers should provide a much reduced latency and cpu usage (at the cost of a higher GPU usage)

…sonal repository Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

github-actions · 2024-11-25T08:31:26Z

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

You've checked our contribution guidelines.
Your PR follows our pull request guidelines.
All required CI checks pass before marking the PR ready for review.

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

…pointcloud changes after the first iteration Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

…ntcloud_preprocessing

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

mojomex

Thank you for the amazing PR, these performance improvements are desperately needed.

I haven't checked the PR for functionality yet, but I'll leave my first round of comments here.

The main points I'd like to address are

memory safety and idiomatic C++ (there is currently a lot of raw-pointer code which should be avoided whenever possible)
modulatiry: currently the pipeline is hard-coded and all in one place. This makes the module hard to adapt to different projects, and hard to maintain individual modules in the pipeline.

Thank you for your time!

mojomex · 2024-11-26T06:14:52Z

sensing/autoware_cuda_pointcloud_preprocessor/README.md

+
+The pointcloud preprocessing implemented in `autoware_pointcloud_preprocessor` has been thoroughly tested in autoware. However, the latency it introduces does not scale well with modern LiDAR devices due to the high number of points they introduce.
+
+To alleviate this issue, this package reimplements most of the pipeline presented in `autoware_pointcloud_preprocessor` leveraging the use of GPGPUs. In particular, this package makes use of CUDA to provide accelerated versions of the already establishes implementations, while also maintaining compatibility with normal ROS nodes/topics. <!-- cSpell: ignore GPGPUs >


Suggested change

To alleviate this issue, this package reimplements most of the pipeline presented in `autoware_pointcloud_preprocessor` leveraging the use of GPGPUs. In particular, this package makes use of CUDA to provide accelerated versions of the already establishes implementations, while also maintaining compatibility with normal ROS nodes/topics. <!-- cSpell: ignore GPGPUs >

To alleviate this issue, this package reimplements most of the pipeline presented in `autoware_pointcloud_preprocessor` leveraging the use of GPGPUs. In particular, this package makes use of CUDA to provide accelerated versions of the already established implementations, while also maintaining compatibility with normal ROS nodes/topics. <!-- cSpell: ignore GPGPUs >

Also, arguably, GPGPUs could be added to the dictionary since it is quite a commonly used term.

mojomex · 2024-11-26T06:18:39Z

sensing/autoware_cuda_pointcloud_preprocessor/config/cuda_pointcloud_preprocessor.param.yaml

+    self_crop.min_x: 1.0
+    self_crop.min_y: 1.0
+    self_crop.min_z: 1.0
+    self_crop.max_x: -1.0
+    self_crop.max_y: -1.0
+    self_crop.max_z: -1.0
+    mirror_crop.min_x: 1.0
+    mirror_crop.min_y: 1.0
+    mirror_crop.min_z: 1.0
+    mirror_crop.max_x: -1.0
+    mirror_crop.max_y: -1.0
+    mirror_crop.max_z: -1.0


Instead of having two hard-coded crop-box filters here, a list would be more easily extensible and it should be quite straight-forward to change the implementation.

mojomex · 2024-11-26T06:19:43Z

sensing/autoware_cuda_pointcloud_preprocessor/README.md

+| Filter Name                       | Description                                                                                                                                  | Detail                                            |
+| --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------- |
+| cuda_organized_pointcloud_adapter | Organizes a pointcloud per ring/channel, so that the memory layout allows parallel processing in cuda                                        | [link](docs/cuda-organized-pointcloud-adapter.md) |
+| cuda_pointcloud_preprocessor      | Implements the cropping, distortion correction, and outlier filtering (ring-based) of the `autoware_pointcloud_preprocessor`'s cpu versions. | [link](docs/cuda-pointcloud-preprocessor.md)      |


Suggested change

| cuda_pointcloud_preprocessor | Implements the cropping, distortion correction, and outlier filtering (ring-based) of the `autoware_pointcloud_preprocessor`'s cpu versions. | [link](docs/cuda-pointcloud-preprocessor.md) |

| cuda_pointcloud_preprocessor | Implements the cropping, distortion correction, and outlier filtering (ring-based) of the `autoware_pointcloud_preprocessor`'s CPU versions. | [link](docs/cuda-pointcloud-preprocessor.md) |

mojomex · 2024-11-26T06:23:24Z

sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md

+This node implements all standard pointcloud preprocessing algorithms applied to a single LiDAR's pointcloud in CUDA.
+In particular, this node implements:
+
+- crop boxing (ego-vehicle and ego-vehicle's mirrors)


Suggested change

- crop boxing (ego-vehicle and ego-vehicle's mirrors)

- box cropping (ego-vehicle and ego-vehicle's mirrors)

mojomex · 2024-11-26T06:25:36Z

sensing/autoware_cuda_pointcloud_preprocessor/docs/cuda-pointcloud-preprocessor.md

+
+## Assumptions / Known limits
+
+- The CUDA implementations, while following the original CPU ones, will not offer the same numerical results, and small approximations were needed to maximize the GPU use.


Suggested change

- The CUDA implementations, while following the original CPU ones, will not offer the same numerical results, and small approximations were needed to maximize the GPU use.

- The CUDA implementations, while following the original CPU ones, will not offer the same numerical results, and small approximations were needed to maximize GPU usage.

mojomex · 2024-12-24T05:41:22Z

...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp

+
+  std::size_t max_ring = 0;
+
+  for (std::size_t i = 0; i < input_pointcloud_msg_ptr->width * input_pointcloud_msg_ptr->height;


Iteration without explicit bounds checking of the underlying array is not memory-safe. Thus, I would suggest using the abovementioned PointCloud2Iterators here.

mojomex · 2024-12-24T05:41:27Z

...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp

+  num_rings_ = std::max(num_rings_, static_cast<std::size_t>(16));
+  std::vector<std::size_t> ring_points(num_rings_, 0);
+
+  for (std::size_t i = 0; i < input_pointcloud_msg_ptr->width * input_pointcloud_msg_ptr->height;


Iteration without explicit bounds checking of the underlying array is not memory-safe. Thus, I would suggest using the abovementioned PointCloud2Iterators here.

mojomex · 2024-12-24T05:42:34Z

...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp

+    max_ring = std::max(max_ring, ring);
+  }
+
+  // Set max rings to the next power of two


Admittedly kind of a niche problem, but not all sensors (Pandar40P) have 2^n rings.

Although auto-detecting the number of rings is nice, it has no hard guarantee to be accurate (e.g. the sensor is under a cover when turned on and there are thus no points in the cloud).

Does cuda_pointcloud_preprocessor support changing dimenions of input pointclouds across iterations (e.g. starts with 0 rings in cloud 1, then 64 rings with 2000 points, then 64 rings with 5000 points each)?
If not, I'd suggest to make n_rings and max_points_per_ring parameters so that we can guarantee correct behavior at runtime.

mojomex · 2024-12-24T05:47:44Z

...reprocessor/src/cuda_organized_pointcloud_adapter/cuda_organized_pointcloud_adapter_node.cpp

+bool CudaOrganizedPointcloudAdapterNode::orderPointcloud(
+  const sensor_msgs::msg::PointCloud2::ConstSharedPtr input_pointcloud_msg_ptr)
+{
+  const autoware::point_types::PointXYZIRCAEDT * input_buffer =


Same comment about bounds/type checking as above 🙇

mojomex · 2024-12-24T05:57:53Z

...uda_pointcloud_preprocessor/src/cuda_pointcloud_preprocessor/cuda_pointcloud_preprocessor.cu

+  if (idx < num_points && masks[idx] == 1) {
+    output_points[indices[idx] - 1] = input_points[idx];
+  }
+}


These two functions are identical except for their argument types. Consider making one templated function instead.

feat: moved the cuda pointcloud preprocessor and organized from a per…

af2e884

…sonal repository Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

github-actions bot added type:documentation Creating or refining documentation. (auto-assigned) component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) tag:require-cuda-build-and-test labels Nov 25, 2024

knzo25 self-assigned this Nov 25, 2024

knzo25 added 4 commits November 25, 2024 18:06

chore: fixed incorrect links

774e099

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

chore: fixed dead links pt2

be04f76

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

chore: fixed spelling errors

db02ec7

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

chore: json schema fixes

5218a4a

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

This was referenced Nov 25, 2024

feat(autoware_lidar_centerpoint): added the cuda_blackboard to centerpoint #9453

Open

feat(autoware_pointcloud_preprocessor): cuda-accelerated pointcloud concatenation #9455

Draft

feat: acceleration and transport layer tier4/aip_launcher#348

Draft

knzo25 marked this pull request as ready for review November 25, 2024 09:47

yukkysaito requested review from YoshiRi and drwnz November 26, 2024 01:22

knzo25 added 2 commits November 26, 2024 13:40

chore: removed comments and filled the fields

4a9daaf

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

fix: fixed the adapter for the case when the number of points in the …

84a4b9f

…pointcloud changes after the first iteration Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

knzo25 mentioned this pull request Nov 26, 2024

feat: acceleration and transport layer autowarefoundation/sample_sensor_kit_launch#111

Draft

knzo25 requested review from manato, mojomex and amadeuszsz November 26, 2024 05:38

knzo25 added 2 commits December 23, 2024 14:12

Merge remote-tracking branch 'awf/main' into feat/cuda_blackboard_poi…

1e27534

…ntcloud_preprocessing

feat: used the cuda host allocators for aster host to device copies

b3c1d72

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>

knzo25 mentioned this pull request Dec 23, 2024

Leverage cuda acceleration in the sensing perception pipeline #9722

Open

3 tasks

technolojin self-requested a review December 24, 2024 01:22

mojomex requested changes Dec 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454

feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454

knzo25 commented Nov 25, 2024 •

edited

Loading

github-actions bot commented Nov 25, 2024 •

edited

Loading

mojomex left a comment

mojomex Nov 26, 2024

mojomex Nov 26, 2024

mojomex Nov 26, 2024

mojomex Nov 26, 2024

mojomex Nov 26, 2024

mojomex Dec 24, 2024

mojomex Dec 24, 2024

mojomex Dec 24, 2024

mojomex Dec 24, 2024

mojomex Dec 24, 2024


		The pointcloud preprocessing implemented in `autoware_pointcloud_preprocessor` has been thoroughly tested in autoware. However, the latency it introduces does not scale well with modern LiDAR devices due to the high number of points they introduce.

		To alleviate this issue, this package reimplements most of the pipeline presented in `autoware_pointcloud_preprocessor` leveraging the use of GPGPUs. In particular, this package makes use of CUDA to provide accelerated versions of the already establishes implementations, while also maintaining compatibility with normal ROS nodes/topics. <!-- cSpell: ignore GPGPUs >

	\| cuda_pointcloud_preprocessor \| Implements the cropping, distortion correction, and outlier filtering (ring-based) of the `autoware_pointcloud_preprocessor`'s cpu versions. \| [link](docs/cuda-pointcloud-preprocessor.md) \|
	\| cuda_pointcloud_preprocessor \| Implements the cropping, distortion correction, and outlier filtering (ring-based) of the `autoware_pointcloud_preprocessor`'s CPU versions. \| [link](docs/cuda-pointcloud-preprocessor.md) \|

	- crop boxing (ego-vehicle and ego-vehicle's mirrors)
	- box cropping (ego-vehicle and ego-vehicle's mirrors)


		## Assumptions / Known limits

		- The CUDA implementations, while following the original CPU ones, will not offer the same numerical results, and small approximations were needed to maximize the GPU use.


		std::size_t max_ring = 0;

		for (std::size_t i = 0; i < input_pointcloud_msg_ptr->width * input_pointcloud_msg_ptr->height;

feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454

Are you sure you want to change the base?

feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454

Conversation

knzo25 commented Nov 25, 2024 • edited Loading

Description

Related links

How was this PR tested?

Notes for reviewers

Interface changes

Effects on system behavior

github-actions bot commented Nov 25, 2024 • edited Loading

mojomex left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knzo25 commented Nov 25, 2024 •

edited

Loading

github-actions bot commented Nov 25, 2024 •

edited

Loading