Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async controllers #3

Open
wants to merge 92 commits into
base: master
Choose a base branch
from
Open

Async controllers #3

wants to merge 92 commits into from

Conversation

saikishor
Copy link
Owner

No description provided.

@saikishor saikishor force-pushed the async_controllers branch 2 times, most recently from 4fede49 to a21ef6f Compare April 15, 2024 21:22
@saikishor saikishor marked this pull request as ready for review April 18, 2024 13:25
controller_manager/src/controller_manager.cpp Outdated Show resolved Hide resolved
{
if (is_running())
{
async_update_stop_ = true;
Copy link

@atzaros atzaros Apr 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deadlock due to lost wake up: stop_async_update() could hang in thread_join() if lines 177-178 are executed while thread_ is in line 209

Surround async_update_stop_ with a mutex in both ends.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already fixed by adding the mutex on both ends

void stop_async_update()
{
if (is_running())
{
{
std::unique_lock<std::mutex> lock(async_mtx_);
async_update_stop_ = true;
}
async_update_condition_.notify_one();
thread_.join();
}
}

Copy link

@atzaros atzaros Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see any mutex protecting this line

ah yes it works but I consider a better practice to protect async_update_stop_ with a mutex because it makes harder this error:

while ((get_state_function_().id() == lifecycle_msgs::msg::State::PRIMARY_STATE_ACTIVE ||
                  get_state_function_().id() ==
                    lifecycle_msgs::msg::State::TRANSITION_STATE_ACTIVATING) &&
                 !async_update_stop_)
          {
          doing_something_while_async_update_stop_is_true(); // <== my future me committing a bug
            {
              std::unique_lock<std::mutex> lock(async_mtx_);
              async_update_condition_.wait(
                lock, [this] { return trigger_in_progress_ || async_update_stop_; });
                ...

My suggestion is:

          std::unique_lock<std::mutex> lock(async_mtx_);
          while ((get_state_function_().id() == lifecycle_msgs::msg::State::PRIMARY_STATE_ACTIVE ||
                  get_state_function_().id() ==
                    lifecycle_msgs::msg::State::TRANSITION_STATE_ACTIVATING) &&
                 !async_update_stop_)
            {
              async_update_condition_.wait(
                lock, [this] { return trigger_in_progress_ || async_update_stop_; });
              if (!async_update_stop_)
              {
                async_update_return_ = async_function_(current_update_time_, current_update_period_);
              }
              trigger_in_progress_ = false;
              cycle_end_condition_.notify_all();
            }

// Async related variables
std::thread thread_;
std::atomic_bool async_update_stop_{false};
std::atomic_bool trigger_in_progress_{false};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no atomic required: if line 195 is changed to lock.owns_lock() && !trigger_in_progress_

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for performance reasons, now you can remove atomic from trigger_in_progress_

lock, [this] { return trigger_in_progress_ || async_update_stop_; });
if (async_update_stop_)
{
break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we break here doing nothing makes threads hang if they're blocked in wait_for_trigger_cycle_to_finish(). We must notify these waiters that we finished the cycle.
The steps that lead to the situation are:

  1. th1: wait_for_trigger_cycle_to_finish()
  2. th2: stop_async_update()

th1 could hang or crash (if stop_async_update() has been called from ~AsyncFunctionHandler())

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

// Async related variables
std::thread thread_;
std::atomic_bool async_update_stop_{false};
std::atomic_bool trigger_in_progress_{false};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for performance reasons, now you can remove atomic from trigger_in_progress_

@@ -125,7 +127,7 @@ TEST_F(AsyncFunctionHandlerTest, check_triggering)
ASSERT_TRUE(async_class.get_handler().is_initialized());
ASSERT_TRUE(async_class.get_handler().is_running());
ASSERT_FALSE(async_class.get_handler().is_stopped());
async_class.get_handler().wait_for_trigger_cycle_to_finish();
std::this_thread::sleep_for(std::chrono::microseconds(1));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally concurrency tests with random delays are suspicious.

So... why did you change that? I guess you observed a deadlock here.

If this is the case, either wait_for_trigger_cycle_to_finish() needs some changes or we need to clarify its behavior (and change the test accordingly).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this 1 microsecond delay in order to run the update cycle. If not, it immediately aborts the update cycle and fails in the assertion of the counter being 2

if (async_update_stop_)
{
trigger_in_progress_ = false;
cycle_end_condition_.notify_one();
break;
}

ASSERT_TRUE(async_class.get_handler().is_initialized());
ASSERT_TRUE(async_class.get_handler().is_running());
ASSERT_FALSE(async_class.get_handler().is_stopped());
async_class.get_handler().wait_for_trigger_cycle_to_finish();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential deadlock due to lost wake up, there is no warranty that this thread is awaiting before anyc_class.thread_ is signaling.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought so, but open reading around, the conditional variable even before waiting., it checks if the predicate is true or false, if it is true, it continues directly without any signal. If it is false, then it starts to wait for the wakeup signal

I've just testing adding 200 ms delay before the wait and the tests pass (no deadlock)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding-delays-until-it-works(TM) is a bad approach because it relays heavily on uncontrolled conditions as computer load, OS (dynamic) scheduler policies, etc.

Basically your tests suffer of UB

controller_interface/test/test_async_function_handler.cpp Outdated Show resolved Hide resolved
ASSERT_TRUE(async_class.get_handler().is_initialized());
ASSERT_TRUE(async_class.get_handler().is_running());
ASSERT_FALSE(async_class.get_handler().is_stopped());
async_class.get_handler().wait_for_trigger_cycle_to_finish();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same potential deadlock between trigger() and wait_for_trigger_cycle_to_finish()

Please add some delay in between to check my assumption

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atzaros I added like 50 ms delay right after the trigger, and the tests seem to pass

}
async_class.get_handler().stop_async_update();

// now the async update should be preempted
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean with preemption?

*/
void join_async_update_thread()
{
if (is_running())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling is_running() and join() from different threads could incur in a data race over thread_'s state, for instance calling wait_for_trigger_cycle_to_finish() and join_async_update_thread() concurrently.

I would document clearly that all class' functions using is_running() and join() cannot be used from different threads concurrently.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean to say when we have more then 2 threads involved?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment on lines 132 to 139
void wait_for_trigger_cycle_to_finish()
{
if (is_running())
{
std::unique_lock<std::mutex> lock(async_mtx_);
cycle_end_condition_.wait(lock, [this] { return !trigger_in_progress_; });
}
}
Copy link

@atzaros atzaros Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now reading at your tests I understand you want to use this function for detecting progress after a trigger has been issued.

I think what you need here is either a heartbeat [1] or just change your approach in tests using synchronization primitives instead of a plain counter

[1] it means some refactoring around trigger() and wait_for_trigger_cycle_to_finish(), the former returning the current heartbeat and the latter checking for any increase

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synchronization primitives sounds like a better idea right? What would you say?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synch primitives in the tests.
If you walk that way, I would remove wait_for_trigger_cycle_to_finish() from the API. It is harmful and, I think, it has been created just because you needed it for testing.

Having a measure of progress in the async's is useful for the controller manager to diagnose abnormal situations but it's a nice-to-have.

saikishor and others added 21 commits September 26, 2024 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants