Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for limiting num_threads in tbb task #252

Merged
merged 50 commits into from
Mar 18, 2024

Conversation

zzodo
Copy link
Contributor

@zzodo zzodo commented Nov 2, 2023

kiss-icp uses tbb library to parallelize loop in correspondence search and point registration, and this have helped to achieve real-time capabilities in the application levels.
In some conditions, however, I was able to experience this tbb-parallelized lines consume all the computing resources.

So this feature, configuring a number of threads to tbb tasks, let us to leave some room for other processes outside of kiss-icp.

Can you confirm this?


And I have some more suggestions:

  • Move correspondence search method to Registration.cpp.
    • This can be done by adding an interface (e.g. voxel_hash_map.begin() and voxel_hash_map.end()) as a const_iterator.
    • In this way, registration takes whole responsibility of searching correspondences between the point clouds.
    • Migrating the responsibility to the registration class may be useful to further customization, while removing hard-coded hyper-parameters; such as voxel search radius at here
  • Declare registration as a class.
    • This is also related to customizability of const-expressed variable in this line.
    • I personally believe that declaring registration as a class can help cpp library users.

@nachovizzo
Copy link
Collaborator

Hello @zzodo Thanks for the contribution.

Do you have any dataset showing the behavior where KISS-ICP consumes all the resources? One of the design principles of TBB is to avoid exposing the number of threads to the user, since if I'm running this on a cluster of thousands of CPUs I'd also like to use all of them.

It might be beneficial to expose the number of threads through the configuration file but this will add yet another parameter to tune.

I will reply the other questions later :)

Thanks again for opening the PR

@zzodo
Copy link
Contributor Author

zzodo commented Nov 3, 2023

The some conditions that I mentioned in the earlier comment was not the dataset, but other parallelized pipelines.
This behavior would not happen, if we use kiss-icp alone.

I can elaborate my point as below:

  • Let us assume we have kiss-icp for Lidar odometry pipeline and relocalization pipeline from another open source package for autonomous mobile robot application.
  • And these two pipelines are constructed asynchronously.
  • In the case we want to run this on low power computing unit, each pipelines construct threads in some kind of 'racing condition' manner. (I don't know how to call this)
  • If the thread-building priorities of the pipelines are set to same level, the concurrency of kiss-icp can be affected by relocalization module
  • This can lead to possible frame drop on kiss-icp pipeline

And I also agree to your concern that number of threads has to be configured outside.
Then how about keeping max_concurrency() (which let tbb to determine maximum num_threads) unless the user configure themselves?
For example:

const std::size_t num_threads = num_threads_ ? num_threads_: max_concurrency();

@nachovizzo
Copy link
Collaborator

@zzodo sorry for the late replies and back and forth.

I'm a bit concerned about introducing this change (@tizianoGuadagnino , @benemer, please tell me if you guys think differently). Mainly because I will not have the time in the next weeks to fully test it and make sure nothing is slipping our hands.

But I have a proposal for you to acknowledge all the great work you have done. Something we can do as a first step to address the problems you report is to completely disable multi-threading. In most cases, the pipeline will still run at the sensor frame rate or even faster, so it's still deployable.

For this, we would need to merge this branch I always try to keep up-to-date: https://github.com/PRBonn/kiss-icp/tree/nacho/single_thread_implementation. I feel like this task is a bit more challenging but might be more beneficial for the community. The task should involve

  • Merging the single_trhread and multi-thread implementation
  • Allowing at build time to the user to select between one or the other (eg, I don't need multithreaded, so no need to pull the entire TBB library, but let me decide if I want to)
  • Expose this in the configuration of the pipeline

What do you think? Would yo be interested in doing this bit?

Thanks again!

@zzodo
Copy link
Contributor Author

zzodo commented Nov 6, 2023

Well, I also think that those tasks may require days to be applied on this pull request.

A few concerns immediately came up to my mind, and those concerns are mainly due to some kind of philosophy I can feel in this repository, such as cmake packaging rules, coding styles, and header/source file hierarchy.

For detail, I got the impression that you have been tried to avoid using macros, conditional compilation, or other (unnecessary) utilizations, which could influence 'keep it simple'. Is that right?

Nevertheless, I think I can deliever you a draft If other contributors agree with my proposal. And it would be my pleasure to hear your preferences, or some ideas that single-threaded version can be merged into master branch.

Best.

@zzodo
Copy link
Contributor Author

zzodo commented Nov 6, 2023

Additional details to my previous comment:

  • For the reason asking your preference is that you mentioned disabling tbb library.
  • There is another way to restrict maximum number of concurrency by using tbb::global_control

Do you want to remove tbb dependency in the single-threaded cases?

@nachovizzo
Copy link
Collaborator

@zzodo sorry for being SO late.

I think I'm a bit in favor of changing this default behavior from the pipeline. The problem I'm currently having is a bit about our API design.

We can indeed make the registration part a class, but we always tried to keep stuff simple and small... I'm trying to figure out if there is a better way to do this, since this would propagate a a ton of changes of the C++, Python, and ROS 1/2 APIs

But in general I'm in favor of adding a parametrization for this bit.

BTW: Where are all the nice plots? they are not on GH but I saw them on the email thread

@tizianoGuadagnino , @benemer It would be nice if you guys could comment a bit on this one

@nachovizzo nachovizzo requested a review from benemer February 11, 2024 20:39
cpp/kiss_icp/core/Registration.cpp Outdated Show resolved Hide resolved
cpp/kiss_icp/core/Registration.cpp Outdated Show resolved Hide resolved
cpp/kiss_icp/core/VoxelHashMap.hpp Outdated Show resolved Hide resolved
@nachovizzo
Copy link
Collaborator

To test the single thread (or other number) with the Python API I used the following env variables

export OPENBLAS_NUM_THREADS=N && export MKL_NUM_THREADS=N && export mapping='{"max_threads": N}'

and pick N

I also kept an eye on htop making sure the changes were having effects

The on a private dataset, testing with a 20 core Intel i7 CPU, processing 1000 frames this was the output

max_threads Avg. Hz
1 26
4 61
8 79
16 96
20 103
--1 99

image

Please take these results with a grain of salt, more experiments would need to be conducted to extract valuable conclusions

And @zzodo would be more than nice if you could add your experiments here!!

@zzodo
Copy link
Contributor Author

zzodo commented Feb 13, 2024

Hi @nachovizzo, long time no see!

It has been a while and trying to remind the previous issues I claimed.
And experiment results(plots) too.

Re-cloned kiss-icp and built both cpp and python packages and tested the odometry pipeline on MulRan dataset. It still works perfectly.

But one question here, I cannot understand your environment settings:

export OPENBLAS_NUM_THREADS=N && export MKL_NUM_THREADS=N && export mapping='{"max_threads": N}'

In my machine, the environmental variable OPENBLAS_NUM_THREADS and MKL_NUM_THREADS actually do not affect tbb's maximum concurrenty nor average frequency of the registration. And the variable mapping='{"max_threads": N}' seems to be your customized config parameters.

And I will share my experiment within a couple of days.
Thanks!

BTW: I deleted some of my past comments since I thought that the results of my private datasets were not enough 😅

@nachovizzo
Copy link
Collaborator

@zzodo the environment thing is something new we automatically pulled with the new pydantic release. I'd say pip install - U pydantic should do the trick

@zzodo
Copy link
Contributor Author

zzodo commented Feb 13, 2024

Understood, but does it work without any Python interfaces that set the maximum concurrency of tbb?

I expected something like config.mapping.max_threads

@zzodo
Copy link
Contributor Author

zzodo commented Feb 13, 2024

Oh I found b327258 that does not pushed into this merge request
I'll try later!

@zzodo
Copy link
Contributor Author

zzodo commented Feb 14, 2024

Here is my experiment results.

The experiment was conducted on my desktop computer, which has 12th Gen Intel i5-12600KF with 16 cores.
For the dataset, I used KAIST MulRan DCC01 sequence starting from 1500th frame to 1600th frame.
You can reproduce similar results using this:

example test script
import os
from pathlib import Path

import kiss_icp
import numpy as np
from kiss_icp.datasets import dataset_factory
from kiss_icp.pipeline import OdometryPipeline
from multiprocessing import Process

from kiss_icp_eval import run_sequence

pipelines = []
def run_mulran_sequence(sequence: str):
  sequence_dir = data_dir / sequence
  return OdometryPipeline(
      dataset=dataset_factory(
          dataloader="mulran",
          data_dir=sequence_dir,
      ),
      deskew=False,
      n_scans=100,
      jump=1500,
  )

if __name__ == '__main__':
  data_root = "/home/dohoon/datasets"
  data_dir = Path(os.path.join(data_root, "MulRan"))
  print(f"Reading datasets from : {data_root}")

  all_sequences = {
      "dcc": ["DCC01"], # or dcc/DCC01
  }
  
  num_processes = 4
  processes = []
  
  for sequence in all_sequences["dcc"]:
      for _ in range(num_processes):
          process = Process(target=run_sequence, args=(run_mulran_sequence, {}), kwargs={'sequence':sequence})
          processes.append(process)
          process.start()

  for process in processes:
      process.join()

Figure_1

As you can see on the figure, there exists some kind of 'racing condition' in terms of computing resource management when we allow the CPU to manage them.

@nachovizzo nachovizzo mentioned this pull request Feb 28, 2024
4 tasks
@benemer
Copy link
Member

benemer commented Mar 1, 2024

Hey,

Sorry for not getting back to you sooner, but I didn't find the time to dig through this PR yet.

In general, I like the idea of merging the single and multi-thread implementations! Setting a maximum number of threads seems to make sense according to your experiments, thanks a lot for this insight.

Also, having a separate Registration class makes sense in my view. I use the voxel hash map in a different project and do not always need to find correspondences, so this is more a feature the Registration requires. Of course, this would introduce a lot of changes in our Python and ROS wrappers...

How do we proceed?

config/advanced.yaml Outdated Show resolved Hide resolved
@nachovizzo
Copy link
Collaborator

Hey,

Sorry for not getting back to you sooner, but I didn't find the time to dig through this PR yet.

In general, I like the idea of merging the single and multi-thread implementations! Setting a maximum number of threads seems to make sense according to your experiments, thanks a lot for this insight.

Also, having a separate Registration class makes sense in my view. I use the voxel hash map in a different project and do not always need to find correspondences, so this is more a feature the Registration requires. Of course, this would introduce a lot of changes in our Python and ROS wrappers...

How do we proceed?

I will split this into two PRs I guess so we review the two changes separately

@nachovizzo nachovizzo closed this Mar 5, 2024
@nachovizzo nachovizzo reopened this Mar 5, 2024
@nachovizzo nachovizzo changed the base branch from nacho/strip_nn_search_from_voxel_hash_map to main March 5, 2024 13:01
@zzodo
Copy link
Contributor Author

zzodo commented Mar 5, 2024

@nachovizzo

I am following up your discussions and feeling comfortable with the changes.
Although I have different preference on naming the methods, the overall changes on Core library seems nice.

Thanks for all your effort!
If there is anything I can help you, let me know 😄

benemer
benemer previously approved these changes Mar 11, 2024
Copy link
Member

@benemer benemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side now

@nachovizzo
Copy link
Collaborator

Update: this pr has no effect in ROS, unless we merge #294. It's just a warning, this does not affect the scope of this PR

nachovizzo
nachovizzo previously approved these changes Mar 11, 2024
Copy link
Collaborator

@nachovizzo nachovizzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-approve? lol

Assuming 10 logical cores: Before Registration::max_num_threads_ == 0,
and now Registration::max_num_threads_ == 10.

This way we hide less what's going under the hood
@nachovizzo nachovizzo dismissed stale reviews from benemer and themself via f550a1a March 11, 2024 18:15
Copy link
Collaborator

@tizianoGuadagnino tizianoGuadagnino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now, I renamed the global control variable in the Registration class, although it was maybe clear I will rather not have single letter named variables. We can merge for me.

@nachovizzo nachovizzo merged commit 994d232 into PRBonn:main Mar 18, 2024
17 checks passed
@nachovizzo
Copy link
Collaborator

btw, a small detail that I post-poned is the fact that the deksew+preprocessing module also make use of tbb, luckily we've picked the right namespace for the max_trhreads params, but something to pay attention in upcoming ticket about this

@zzodo zzodo deleted the feature/set-num-threads branch March 18, 2024 13:21
@zzodo
Copy link
Contributor Author

zzodo commented Apr 4, 2024

FYI:
I just noticed that the tbb::global_control variable is declared as static:

static const auto tbb_control_settings = tbb::global_control(

This is not problematic at this moment, but may affect other tbb-parallelized pipelines.
From the official document:

The current set of parameters that you can modify is defined by the global_control::parameter enumeration. The parameter and the value it should take are specified as arguments to the constructor of a control variable. The impact of the control variable ends when its lifetime is complete.

We should keep our eyes on this!

@nachovizzo
Copy link
Collaborator

FYI: I just noticed that the tbb::global_control variable is declared as static:

static const auto tbb_control_settings = tbb::global_control(

This is not problematic at this moment, but may affect other tbb-parallelized pipelines. From the official document:

The current set of parameters that you can modify is defined by the global_control::parameter enumeration. The parameter and the value it should take are specified as arguments to the constructor of a control variable. The impact of the control variable ends when its lifetime is complete.

We should keep our eyes on this!

Thanks for noticing it. Yes, I know it's a nasty hack to control this variable with static duration.... but I didn't want to corrupt the registration class with minor details. This is the way I found to "hide' this implementation detail. If you have a better idea, please propose!

@zzodo
Copy link
Contributor Author

zzodo commented Apr 5, 2024

EDIT for possible confusion:
I think it is beneficial declare Registration as a class for hiding implementation details and restricting the accessibility of those variables out of class.

I would say it is better to keep this variable as a private member variable.
We can replace max_num_threads_ into tbb::global_contol.
e.g. Registration.hpp:

class Registration {
  public:
    explicit Registration(int max_num_iteration, double convergence_criterion, int max_num_threads)
        : max_num_iterations_(max_num_iteration),
          convergence_criterion_(convergence_criterion),
          tbb_control_(tbb::global_control::max_allowed_parallelism, max_num_threads) {}

    Sophus::SE3d AlignPointsToMap(const std::vector<Eigen::Vector3d> &frame,
                                  const VoxelHashMap &voxel_map,
                                  const Sophus::SE3d &initial_guess,
                                  double max_correspondence_distance,
                                  double kernel);

  private:
    int max_num_iterations_;
    double convergence_criterion_;
    tbb::global_control tbb_control_;
};

The lifetime of this global settings for tbb pipelines ends when the dtor of Registration class variable is called.

Can I ask you about the reason why you are trying to avoid declaring tbb::global_control variable in private of Registration class?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants