-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(ring_outlier_filter): a cache friendly impl (continuation of VRichardJP's work) #4185
perf(ring_outlier_filter): a cache friendly impl (continuation of VRichardJP's work) #4185
Conversation
3568529
to
8d0d080
Compare
It took some time to set up valid performance benchmarks on my new laptop but here they are. All benchmarks are done on real-world VLS128 data (ca. 60s long), on an isolated core of my machine, with turbo boost disabled and all cores pinned to the highest non-boost frequency (2.3 GHz). My machine: I have benchmarked the following implementations:
This shows that the trigonometry done in isCluster's law of cosines computation is well worth the extra speedup achieved by the emptier cache. Richard's implementation becomes faster where the original becomes slower; to investigate this, I plotted the number of input/remaining output points over time. The original implementation has significant overhead for copying points into the output because it reorders the points in ring-major (instead of the incoming azimuth-major) order. As shown in Richard's pull request, the number of cache misses significantly decreases (even with a big cache), in my case by around 35 %. I will test this tomorrow on VLP16 data to confirm there is no (significant) performance regression, then I'm good to mark this as ready to merge. |
@mojomex Thank you so much for taking over! If I understand your second graph, the clouds are more dense and contain fewer isolated points from around 20s in the rosbag. The original implementation performance gets worse because cache miss is the bottleneck, and more points means increased cache miss count. However I don't really understand why my implementation would be much faster. Is it because with this implementation the bottleneck is not the number of points, but rather the number/length of walks? For example, if we randomly move X% of the input points so they become noisy points to be removed, is there a threshold from which the original implementation will perform faster? I don't know how noisy lidar data may become under heavy condition (e.g. heavy rain, fog, etc), maybe like 7~8%? |
Mystery solved, your intuition that the number of walks is the cause of this timing behavior was correct. The upper plot shows the timing of each of the 5 phases of the function (stacked, in total they equal the whole processing time). It can be seen that in the "many walks" section, the build walks and check clusters phase are significantly slower than in the last section. This makes sense when looking at the code: |
point_walk_idx = ring.walk_idx; | ||
} else { | ||
// previous walk is finished, start a new one | ||
walks.push_back(WalkInfo{1, curr_distance, curr_azimuth, curr_distance, curr_azimuth}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We reserve max_rings_num_
walks in the walks
vector, so it needs to be resized multiple times when appending on the order of 16000 walks, as is the case in the "many walks" section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I only allocated the strict minimum (1 walk per ring), but a bigger value might be wiser.
std::vector<bool> walk_is_cluster; | ||
walk_is_cluster.reserve(walks.size()); | ||
for (const auto & walk_info : walks) { | ||
walk_is_cluster.push_back(isCluster(walk_info)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section's runtime of course scales linearly with the number of walks.
I have made the walks array fixed-size now, storing only the first and previous walk of any given ring. The points are then checked against the cluster status of their assigned walk ID, without needing the actual walk struct to exist. I have tried reserving a large number of elements for the This is admittedly a small improvement but imho it did not make the code less readable and it also improves memory usage. As can be seen, it's not entirely possible to get rid of the higher computation times in the first third. |
8d0d080
to
74a48dc
Compare
@mojomex Thank you for this PR 👍 |
sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp
Show resolved
Hide resolved
4544f70
to
ba07e47
Compare
@yukkysaito I have simplified the code (and by effect also further improved performance). I have removed the functions for finding PointCloud2 field offsets and for accessing those fields, that Richard added. I also removed the We are now at ~33% faster performance compared to the original implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1529526
to
574b17a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4185 +/- ##
==========================================
- Coverage 14.91% 14.90% -0.01%
==========================================
Files 1520 1520
Lines 104738 104785 +47
Branches 31884 31884
==========================================
Hits 15619 15619
- Misses 72067 72114 +47
Partials 17052 17052
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I would advice against blindly assuming the data layout. This has caused this kind of issues in the past: #2618 I had opened #3318 to deal with that, but at the time ring_outlier_filter was still under rework. For example, something like this would be nice: {
bool missing_field = false;
missing_field |= !autoware_point_types::hasField<autoware_point_types::PointIndex::Ring>(input);
missing_field |= !autoware_point_types::hasField<autoware_point_types::PointIndex::Azimuth>(input);
//...
if(missing_field) {
// Print err, etc
}
// or as I suggested in the issue:
if(!utils::is_data_layout_compatible_with_PointXYZI(*input)) {
// Print err, etc
} Or anything that could make sure that the data we read is actually what we want |
18fda2a
to
8bb958c
Compare
Thanks for the feedback! I would have liked to reuse |
1c1be9a
to
98e1950
Compare
@yukkysaito Seems like it has no valid API key:
|
WalkthroughThe changes include modifications to the Changes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Files selected (4)
- sensing/pointcloud_preprocessor/docs/ring-outlier-filter.md (1)
- sensing/pointcloud_preprocessor/include/pointcloud_preprocessor/outlier_filter/ring_outlier_filter_nodelet.hpp (1)
- sensing/pointcloud_preprocessor/package.xml (1)
- sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp (4)
Files not summarized due to errors (1)
- sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp (diff tokens exceeds limit)
Files skipped from review due to trivial changes (2)
- sensing/pointcloud_preprocessor/docs/ring-outlier-filter.md
- sensing/pointcloud_preprocessor/package.xml
Review comments generated (3)
- Review: 3
- LGTM: 0
Tips
Chat with CodeRabbit Bot (@coderabbitai
)
- Reply on review comments left by this bot to ask follow-up questions. A review comment is a comment on a diff or a file.
- Invite the bot into a review comment chain by tagging
@coderabbitai
in a reply.
Code suggestions
- The bot may make code suggestions, but please review them carefully before committing since the line number ranges may be misaligned.
- You can edit the comment made by the bot and manually tweak the suggestion if it is slightly off.
Pausing incremental reviews
- Add
@coderabbitai: ignore
anywhere in the PR description to pause further reviews from the bot.
..._preprocessor/include/pointcloud_preprocessor/outlier_filter/ring_outlier_filter_nodelet.hpp
Show resolved
Hide resolved
sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp
Show resolved
Hide resolved
sensing/pointcloud_preprocessor/src/outlier_filter/ring_outlier_filter_nodelet.cpp
Show resolved
Hide resolved
The CI tasks for this PR seem not to be picked up by the self-hosted runners, they have been queuing for almost 3 weeks. |
@mojomex You can merge this PR after the CI (build-and-test-differential) passes. https://github.com/autowarefoundation/autoware.universe/actions/runs/5945676273/job/16125118660?pr=4185 |
Thank you for re-adding the label! I don't have write access to this repo, would you mind merging it instead, @miursh? |
…chardJP's work) (autowarefoundation#4185) * perf(ring_outlier_filter): a cache friendly impl Signed-off-by: Vincent Richard <[email protected]> style(pre-commit): autofix fix(ring_outlier_filter): cleanup code with ranges Signed-off-by: Vincent Richard <[email protected]> [breaking] fix autoware point type padding for faster memory access can memcpy between input data buffer and PointXYZI* make assumption on memory layout for faster data fetch misc optimization (reordering, constexpr, etc) Signed-off-by: Vincent Richard <[email protected]> style(pre-commit): autofix comment limitations Signed-off-by: Vincent Richard <[email protected]> feat(ring_outlier_filter): add accurate isCluster impl Signed-off-by: Vincent Richard <[email protected]> style(pre-commit): autofix fix autowarefoundation#3218 Signed-off-by: Vincent Richard <[email protected]> cleaning Signed-off-by: Vincent Richard <[email protected]> style(pre-commit): autofix * style(pre-commit): autofix * resize vector to data size Signed-off-by: Vincent Richard <[email protected]> * cleaning Signed-off-by: Vincent Richard <[email protected]> * cleaner utilities impl Signed-off-by: Vincent Richard <[email protected]> * Correct ring_outlier_filter algorithm The implementation by VRichardJP was erroneously using indices into the input->fields array as byte offsets for copying values from the input. Added the missing step to access the offset of the field pointed at by the respective index. When combining the first and last walk, the num_points fields of the two walks need to be summed up as well. isCluster checks for num_points as well as distance. Signed-off-by: Maximilian Schmeller <[email protected]> * Optimize memory usage of walks array The walks array in the previous implementation could in the worst case grow to the number of input points (if every point is an outlier). This increased computation time for gradually growing the walks vector. The current implementation only saves the first and current walk for every ring. The isCluster check is performed right when a walk is completed, and a bool vector akin to the previous walk_is_cluster vector is maintained. Signed-off-by: Maximilian Schmeller <[email protected]> * Remove debug timers and outputs Signed-off-by: Maximilian Schmeller <[email protected]> * style(pre-commit): autofix * Reduce amount of helper functions and data structures Remove functions for finding field offsets of Pointcloud2 points, replace them with the previous approach of assuming their position. This is okay since the ring_outlier_filter specification clearly states the expected pointcloud format. Remove accessor lambda functions for ring/azimuth/etc. because they are onlu used once anyways. Remove the walks vector and save the first/current walk structs directly in the RingWalkInfo struct. This makes the code more readable and maintainable. Signed-off-by: Maximilian Schmeller <[email protected]> * style(pre-commit): autofix * Check for invalid pointcloud format In the previous revision (and before this PR), pointcloud field indices in ring_outlier_filter were hardcoded and fields were assumed to exist. VRichardJP solved this by implementing individual accessor functions, which ended up to be a bit verbose for this small piece of functionality. Now, there is one generic accessor function and the faster_filter function exits if any of the fields expected does not exist. The expected field data type is also checked. Signed-off-by: Maximilian Schmeller <[email protected]> * style(pre-commit): autofix --------- Signed-off-by: Vincent Richard <[email protected]> Signed-off-by: Maximilian Schmeller <[email protected]> Co-authored-by: Vincent Richard <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Maximilian Schmeller <[email protected]>
…n of VRichardJP's work) (autowarefoundation#4185)" This reverts commit e45c1ce.
…n of VRichardJP's work) (autowarefoundation#4185)" This reverts commit e45c1ce.
…n of VRichardJP's work) (autowarefoundation#4185)" This reverts commit e45c1ce.
…n of VRichardJP's work) (autowarefoundation#4185)" This reverts commit e45c1ce.
…n of VRichardJP's work) (autowarefoundation#4185)" This reverts commit e45c1ce.
This was reverted, to reproduce the problem, @miursh -san shared the data: |
…gation (autowarefoundation#6148) Revert "perf(ring_outlier_filter): a cache friendly impl (continuation of VRichardJP's work) (autowarefoundation#4185)" This reverts commit e45c1ce.
…gation (autowarefoundation#6148) Revert "perf(ring_outlier_filter): a cache friendly impl (continuation of VRichardJP's work) (autowarefoundation#4185)" This reverts commit e45c1ce.
Description
This is the completion of @VRichardJP's PR #3293 for a cache friendly implementation of ring_outlier_filter.
The implementation goes through the point cloud (which is in azimuth-major order) sequentially and keeps track of walks in the LiDAR rings concurrently. See #3269 for a description of the problem solved here.
Related links
ring_outlier_filter
blows up my cache (and yours too) #3269Tests performed
See the posts below.
Notes for reviewers
Pre-review checklist for the PR author
The PR author must check the checkboxes below when creating the PR.
In-review checklist for the PR reviewers
The PR reviewers must check the checkboxes below before approval.
Post-review checklist for the PR author
The PR author must check the checkboxes below before merging.
After all checkboxes are checked, anyone who has write access can merge the PR.
Summary by CodeRabbit
Release Notes:
ring_outlier_filter
implementation has undergone changes to its core parameters. Themax_points_num_per_ring
parameter has been removed, and the description formax_rings_num
is missing.isCluster
function has been removed from theRingOutlierFilterComponent
class, indicating a change or relocation of the logic for determining clusters.package.xml
) has been updated to include a new dependency onrange-v3
.