Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(ring_outlier_filter): a cache friendly impl (continuation of VRichardJP's work) #4185

Merged

Conversation

mojomex
Copy link
Contributor

@mojomex mojomex commented Jul 6, 2023

Description

This is the completion of @VRichardJP's PR #3293 for a cache friendly implementation of ring_outlier_filter.
The implementation goes through the point cloud (which is in azimuth-major order) sequentially and keeps track of walks in the LiDAR rings concurrently. See #3269 for a description of the problem solved here.

Related links

Tests performed

See the posts below.

Notes for reviewers

Pre-review checklist for the PR author

The PR author must check the checkboxes below when creating the PR.

In-review checklist for the PR reviewers

The PR reviewers must check the checkboxes below before approval.

  • The PR follows the pull request guidelines.
  • The PR has been properly tested.
  • The PR has been reviewed by the code owners.

Post-review checklist for the PR author

The PR author must check the checkboxes below before merging.

  • There are no open discussions or they are tracked via tickets.
  • The PR is ready for merge.

After all checkboxes are checked, anyone who has write access can merge the PR.

Summary by CodeRabbit

Release Notes:

  • New Feature: The ring_outlier_filter implementation has undergone changes to its core parameters. The max_points_num_per_ring parameter has been removed, and the description for max_rings_num is missing.
  • Refactor: The isCluster function has been removed from the RingOutlierFilterComponent class, indicating a change or relocation of the logic for determining clusters.
  • Documentation: The package manifest file (package.xml) has been updated to include a new dependency on range-v3.

"Code changes dance,
Parameters take a chance.
Clusters shift their stance,
With dependencies enhanced."

@github-actions github-actions bot added type:documentation Creating or refining documentation. (auto-assigned) component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) labels Jul 6, 2023
@mojomex mojomex force-pushed the cache_friendly_ring_outlier_filter branch 4 times, most recently from 3568529 to 8d0d080 Compare July 11, 2023 09:24
@mojomex
Copy link
Contributor Author

mojomex commented Jul 11, 2023

It took some time to set up valid performance benchmarks on my new laptop but here they are.

All benchmarks are done on real-world VLS128 data (ca. 60s long), on an isolated core of my machine, with turbo boost disabled and all cores pinned to the highest non-boost frequency (2.3 GHz).
All benchmarks have been repeated 3 times, the combined measurements are seen below.

My machine:
CPU: Intel Core i7-12700H
L1d: 544 KiB (14 instances)
L1i: 704 KiB (14 instances)
L2: 11.5 MiB (8 instances)
L3: 24 MiB (1 instance)
RAM: 32 GB

I have benchmarked the following implementations:

Label Implementation Results
before Original implementation slow
before w/ cos Original implementation, but isCluster is modified to use the law of cosines equally slow (within measurement tolerance)
rich @VRichardJP's implementation (I fixed the remaining errors) fast (~25% speedup)
rich w/o cos Richard's implementation, but with the euclidean isCluster from before (needs to store XYZ for each ring's current walk start and end point a little less fast (~15% speedup)

This shows that the trigonometry done in isCluster's law of cosines computation is well worth the extra speedup achieved by the emptier cache.

image

Richard's implementation becomes faster where the original becomes slower; to investigate this, I plotted the number of input/remaining output points over time. The original implementation has significant overhead for copying points into the output because it reorders the points in ring-major (instead of the incoming azimuth-major) order.

image

As shown in Richard's pull request, the number of cache misses significantly decreases (even with a big cache), in my case by around 35 %.

image

I will test this tomorrow on VLP16 data to confirm there is no (significant) performance regression, then I'm good to mark this as ready to merge.

@VRichardJP
Copy link
Contributor

VRichardJP commented Jul 11, 2023

@mojomex Thank you so much for taking over!

If I understand your second graph, the clouds are more dense and contain fewer isolated points from around 20s in the rosbag. The original implementation performance gets worse because cache miss is the bottleneck, and more points means increased cache miss count. However I don't really understand why my implementation would be much faster. Is it because with this implementation the bottleneck is not the number of points, but rather the number/length of walks?

For example, if we randomly move X% of the input points so they become noisy points to be removed, is there a threshold from which the original implementation will perform faster? I don't know how noisy lidar data may become under heavy condition (e.g. heavy rain, fog, etc), maybe like 7~8%?

@mojomex
Copy link
Contributor Author

mojomex commented Jul 12, 2023

Mystery solved, your intuition that the number of walks is the cause of this timing behavior was correct.

The upper plot shows the timing of each of the 5 phases of the function (stacked, in total they equal the whole processing time).
In black, the number of walks generated is overlaid.
I have highlighted 3 sections (startup, many walks, few walks). The 5 phases are shown as box plots for each of those sections.
Dotted black lines show the baseline mean (the last section, few walks).

image

It can be seen that in the "many walks" section, the build walks and check clusters phase are significantly slower than in the last section.

This makes sense when looking at the code:

point_walk_idx = ring.walk_idx;
} else {
// previous walk is finished, start a new one
walks.push_back(WalkInfo{1, curr_distance, curr_azimuth, curr_distance, curr_azimuth});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reserve max_rings_num_ walks in the walks vector, so it needs to be resized multiple times when appending on the order of 16000 walks, as is the case in the "many walks" section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I only allocated the strict minimum (1 walk per ring), but a bigger value might be wiser.

std::vector<bool> walk_is_cluster;
walk_is_cluster.reserve(walks.size());
for (const auto & walk_info : walks) {
walk_is_cluster.push_back(isCluster(walk_info));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section's runtime of course scales linearly with the number of walks.

@mojomex
Copy link
Contributor Author

mojomex commented Jul 12, 2023

I have made the walks array fixed-size now, storing only the first and previous walk of any given ring.
The points_walk_idx array is now the points_alk_id array, storing not an index into walks but unique walk ID, which is used as an index into the walks_cluster_status boolean array, containing the results of isCluster. Because walks are immediately discarded when they are finished, isCluster is now immediately called when that happens.

The points are then checked against the cluster status of their assigned walk ID, without needing the actual walk struct to exist.
Notably, the walks_cluster_status array can also blow up to at most input->width elements (as the walks array could before). But, since it only stores bools instead of structs with a few values in them, there is a slight speedup (see purple (new) vs. green (proposed by Richard)).

I have tried reserving a large number of elements for the walks_cluster_status array, it does not make a measurable difference to letting it grow.

This is admittedly a small improvement but imho it did not make the code less readable and it also improves memory usage.

image

As can be seen, it's not entirely possible to get rid of the higher computation times in the first third.

@mojomex mojomex force-pushed the cache_friendly_ring_outlier_filter branch from 8d0d080 to 74a48dc Compare July 12, 2023 08:13
@mojomex mojomex marked this pull request as ready for review July 12, 2023 08:13
@yukkysaito
Copy link
Contributor

@mojomex Thank you for this PR 👍
Is it possible to split the function and abstract it since it is a bit large?

@mojomex mojomex force-pushed the cache_friendly_ring_outlier_filter branch from 4544f70 to ba07e47 Compare July 13, 2023 04:15
@mojomex
Copy link
Contributor Author

mojomex commented Jul 13, 2023

@yukkysaito I have simplified the code (and by effect also further improved performance).

I have removed the functions for finding PointCloud2 field offsets and for accessing those fields, that Richard added.
Because the ring_outlier_filter docs specify the expected input pointcloud format, it's reasonable to go with the old approach (before Richard's PR). That approach expected all the fields to exist, thus making the safe accessors unnecesary.

I also removed the walks vector as we are storing a constant 2 walks per ring now. I am storing 2 WalkInfo structs directly in the RingWalkInfo struct now. This also results in a performance gain (see the red points in the chart below).

We are now at ~33% faster performance compared to the original implementation.

image

Copy link
Contributor

@yukkysaito yukkysaito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mojomex mojomex force-pushed the cache_friendly_ring_outlier_filter branch from 1529526 to 574b17a Compare July 13, 2023 05:57
@codecov
Copy link

codecov bot commented Jul 13, 2023

Codecov Report

Attention: Patch coverage is 0% with 98 lines in your changes missing coverage. Please review.

Project coverage is 14.90%. Comparing base (0730888) to head (f9054d9).
Report is 3342 commits behind head on main.

Files with missing lines Patch % Lines
...src/outlier_filter/ring_outlier_filter_nodelet.cpp 0.00% 98 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4185      +/-   ##
==========================================
- Coverage   14.91%   14.90%   -0.01%     
==========================================
  Files        1520     1520              
  Lines      104738   104785      +47     
  Branches    31884    31884              
==========================================
  Hits        15619    15619              
- Misses      72067    72114      +47     
  Partials    17052    17052              
Flag Coverage Δ *Carryforward flag
differential 6.72% <0.00%> (?)
total 14.92% <ø> (+<0.01%) ⬆️ Carriedforward from 0730888

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@VRichardJP
Copy link
Contributor

VRichardJP commented Jul 13, 2023

I would advice against blindly assuming the data layout. This has caused this kind of issues in the past: #2618

I had opened #3318 to deal with that, but at the time ring_outlier_filter was still under rework.

For example, something like this would be nice:

{
  bool missing_field = false;
  missing_field |= !autoware_point_types::hasField<autoware_point_types::PointIndex::Ring>(input);
  missing_field |= !autoware_point_types::hasField<autoware_point_types::PointIndex::Azimuth>(input);
  //...

  if(missing_field) {
    // Print err, etc
  }

  // or as I suggested in the issue:
  if(!utils::is_data_layout_compatible_with_PointXYZI(*input)) {
  // Print err, etc
  }

Or anything that could make sure that the data we read is actually what we want

@mojomex mojomex force-pushed the cache_friendly_ring_outlier_filter branch 2 times, most recently from 18fda2a to 8bb958c Compare July 14, 2023 05:50
@mojomex
Copy link
Contributor Author

mojomex commented Jul 14, 2023

Thanks for the feedback!
I implemented a generic accessor function that looks up the field index and safely reads the offset.
Errors are logged and the function exits if any of the required fields is not there.
The field's data type is also checked against the expected one.

I would have liked to reuse utils::is_data_layout_compatible_with_PointXYZIRADRT but the ring_outlier_filter spec expects Point_XYZIRADT (without return type).
Let's hope this can be implemented more cleanly after discussion #3442.

@mojomex mojomex requested a review from yukkysaito July 14, 2023 06:27
@mojomex mojomex force-pushed the cache_friendly_ring_outlier_filter branch from 1c1be9a to 98e1950 Compare July 25, 2023 01:13
@mojomex
Copy link
Contributor Author

mojomex commented Aug 3, 2023

@mojomex Can I try to Enable review by ChatGPT on a trial basis? It may be a little noisy.

@yukkysaito Seems like it has no valid API key:

Warning: Skipped: failed to create summary bot, please check your openai_api_key: Error: Unable to initialize the OpenAI API, both 'OPENAI_API_KEY' environment variable are not available, backtrace: Error: Unable to initialize the OpenAI API, both 'OPENAI_API_KEY' environment variable are not available

@github-actions
Copy link

github-actions bot commented Aug 4, 2023

Image description CodeRabbit

Walkthrough

The changes include modifications to the ring_outlier_filter implementation and the package manifest file in the sensing/pointcloud_preprocessor directory. The core parameters of the ring_outlier_filter have been updated, with the removal of the max_points_num_per_ring parameter and a missing description for the max_rings_num parameter. Additionally, the logic for determining clusters has been changed or moved elsewhere, as the isCluster function has been removed. A new dependency on range-v3 has been added and included in the package manifest file.

Changes

File(s) Summary
sensing/pointcloud_preprocessor/docs/ring-outlier-filter.md The diff includes changes to the core parameters of the ring_outlier_filter implementation. The max_points_num_per_ring parameter has been removed from the list of parameters. The description for max_rings_num is missing.
sensing/pointcloud_preprocessor/include/pointcloud_preprocessor/outlier_filter/ring_outlier_filter_nodelet.hpp The diff removes the isCluster function from the RingOutlierFilterComponent class. This function was previously used to determine if a set of points formed a cluster based on certain thresholds. The removal of this function suggests that the logic for determining clusters has been changed or moved elsewhere.
sensing/pointcloud_preprocessor/package.xml The diff adds a new dependency on range-v3 and updates the package manifest file to include it.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image description CodeRabbit

Commits Files that changed from the base of the PR and between 0730888 and f9054d9 commits.
Files selected (4)
  • sensing/pointcloud_preprocessor/docs/ring-outlier-filter.md (1)
  • sensing/pointcloud_preprocessor/include/pointcloud_preprocessor/outlier_filter/ring_outlier_filter_nodelet.hpp (1)
  • sensing/pointcloud_preprocessor/package.xml (1)
  • sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp (4)
Files not summarized due to errors (1)
  • sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp (diff tokens exceeds limit)
Files skipped from review due to trivial changes (2)
  • sensing/pointcloud_preprocessor/docs/ring-outlier-filter.md
  • sensing/pointcloud_preprocessor/package.xml
Review comments generated (3)
  • Review: 3
  • LGTM: 0

Tips

Chat with Image description CodeRabbit Bot (@coderabbitai)

  • Reply on review comments left by this bot to ask follow-up questions. A review comment is a comment on a diff or a file.
  • Invite the bot into a review comment chain by tagging @coderabbitai in a reply.

Code suggestions

  • The bot may make code suggestions, but please review them carefully before committing since the line number ranges may be misaligned.
  • You can edit the comment made by the bot and manually tweak the suggestion if it is slightly off.

Pausing incremental reviews

  • Add @coderabbitai: ignore anywhere in the PR description to pause further reviews from the bot.

@yukkysaito
Copy link
Contributor

@mojomex

Seems like it has no valid API key:

#4515

It seemed that the secret key could not be referenced from the forked repository, so it was not working properly. This PR solved it.

I think it is working correctly now.

@mojomex
Copy link
Contributor Author

mojomex commented Aug 22, 2023

The CI tasks for this PR seem not to be picked up by the self-hosted runners, they have been queuing for almost 3 weeks.
Is there something wrong with the configuration in my fork or this PR's tags or is it just bad luck?
I would really appreciate if @yukkysaito or @miursh could take a look 🙇

@miursh miursh added run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) and removed run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) labels Aug 23, 2023
@miursh
Copy link
Contributor

miursh commented Aug 23, 2023

@mojomex You can merge this PR after the CI (build-and-test-differential) passes. https://github.com/autowarefoundation/autoware.universe/actions/runs/5945676273/job/16125118660?pr=4185

@mojomex
Copy link
Contributor Author

mojomex commented Aug 23, 2023

@mojomex You can merge this PR after the CI (build-and-test-differential) passes. https://github.com/autowarefoundation/autoware.universe/actions/runs/5945676273/job/16125118660?pr=4185

Thank you for re-adding the label! I don't have write access to this repo, would you mind merging it instead, @miursh?

@miursh miursh merged commit e45c1ce into autowarefoundation:main Aug 23, 2023
@mojomex mojomex deleted the cache_friendly_ring_outlier_filter branch August 23, 2023 05:22
kminoda pushed a commit to kminoda/autoware.universe that referenced this pull request Aug 23, 2023
…chardJP's work) (autowarefoundation#4185)

* perf(ring_outlier_filter): a cache friendly impl

Signed-off-by: Vincent Richard <[email protected]>

style(pre-commit): autofix

fix(ring_outlier_filter): cleanup code with ranges

Signed-off-by: Vincent Richard <[email protected]>

[breaking] fix autoware point type padding for faster memory access

can memcpy between input data buffer and PointXYZI*
make assumption on memory layout for faster data fetch
misc optimization (reordering, constexpr, etc)

Signed-off-by: Vincent Richard <[email protected]>

style(pre-commit): autofix

comment limitations

Signed-off-by: Vincent Richard <[email protected]>

feat(ring_outlier_filter): add accurate isCluster impl

Signed-off-by: Vincent Richard <[email protected]>

style(pre-commit): autofix

fix autowarefoundation#3218

Signed-off-by: Vincent Richard <[email protected]>

cleaning

Signed-off-by: Vincent Richard <[email protected]>

style(pre-commit): autofix

* style(pre-commit): autofix

* resize vector to data size

Signed-off-by: Vincent Richard <[email protected]>

* cleaning

Signed-off-by: Vincent Richard <[email protected]>

* cleaner utilities impl

Signed-off-by: Vincent Richard <[email protected]>

* Correct ring_outlier_filter algorithm

The implementation by VRichardJP was erroneously using indices into the
input->fields array as byte offsets for copying values from the input.
Added the missing step to access the offset of the field pointed at by
the respective index.

When combining the first and last walk, the num_points fields of the two
walks need to be summed up as well. isCluster checks for num_points
as well as distance.

Signed-off-by: Maximilian Schmeller <[email protected]>

* Optimize memory usage of walks array

The walks array in the previous implementation could in the worst case grow
to the number of input points (if every point is an outlier).
This increased computation time for gradually growing the walks vector.

The current implementation only saves the first and current walk for
every ring. The isCluster check is performed right when a walk is
completed, and a bool vector akin to the previous walk_is_cluster
vector is maintained.

Signed-off-by: Maximilian Schmeller <[email protected]>

* Remove debug timers and outputs

Signed-off-by: Maximilian Schmeller <[email protected]>

* style(pre-commit): autofix

* Reduce amount of helper functions and data structures

Remove functions for finding field offsets of Pointcloud2 points,
replace them with the previous approach of assuming their position.
This is okay since the ring_outlier_filter specification clearly states
the expected pointcloud format.

Remove accessor lambda functions for ring/azimuth/etc. because they are
onlu used once anyways.

Remove the walks vector and save the first/current walk structs directly
in the RingWalkInfo struct. This makes the code more readable and
maintainable.

Signed-off-by: Maximilian Schmeller <[email protected]>

* style(pre-commit): autofix

* Check for invalid pointcloud format

In the previous revision (and before this PR), pointcloud field indices
in ring_outlier_filter were hardcoded and fields were assumed to exist.

VRichardJP solved this by implementing individual accessor functions,
which ended up to be a bit verbose for this small piece of
functionality.

Now, there is one generic accessor function and the faster_filter
function exits if any of the fields expected does not exist.
The expected field data type is also checked.

Signed-off-by: Maximilian Schmeller <[email protected]>

* style(pre-commit): autofix

---------

Signed-off-by: Vincent Richard <[email protected]>
Signed-off-by: Maximilian Schmeller <[email protected]>
Co-authored-by: Vincent Richard <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Maximilian Schmeller <[email protected]>
tzhong518 added a commit to tzhong518/autoware.universe that referenced this pull request Jan 23, 2024
tzhong518 added a commit to tzhong518/autoware.universe that referenced this pull request Jan 23, 2024
tzhong518 added a commit to tzhong518/autoware.universe that referenced this pull request Jan 23, 2024
@tzhong518 tzhong518 mentioned this pull request Jan 23, 2024
7 tasks
tzhong518 added a commit to tzhong518/autoware.universe that referenced this pull request Jan 23, 2024
tzhong518 added a commit to tzhong518/autoware.universe that referenced this pull request Jan 23, 2024
miursh pushed a commit that referenced this pull request Jan 24, 2024
…gation (#6148)

Revert "perf(ring_outlier_filter): a cache friendly impl (continuation of VRichardJP's work) (#4185)"

This reverts commit e45c1ce.
@xmfcx
Copy link
Contributor

xmfcx commented Jan 24, 2024

This was reverted, to reproduce the problem, @miursh -san shared the data:

shmpwk pushed a commit to tier4/autoware.universe that referenced this pull request Jan 25, 2024
…gation (autowarefoundation#6148)

Revert "perf(ring_outlier_filter): a cache friendly impl (continuation of VRichardJP's work) (autowarefoundation#4185)"

This reverts commit e45c1ce.
karishma1911 pushed a commit to Interplai/autoware.universe that referenced this pull request Jun 3, 2024
…gation (autowarefoundation#6148)

Revert "perf(ring_outlier_filter): a cache friendly impl (continuation of VRichardJP's work) (autowarefoundation#4185)"

This reverts commit e45c1ce.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) tag:require-cuda-build-and-test type:documentation Creating or refining documentation. (auto-assigned)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants