Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource: support partial cancel of resources external to broker ranks #1292

Merged
merged 5 commits into from
Nov 2, 2024

Conversation

milroy
Copy link
Member

@milroy milroy commented Sep 5, 2024

Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. This PR adds a goto statement to return 0 under this circumstance. This PR is significantly updated in scope to reflect more comprehensive understanding of the problem.

This line is causing some of the errors reported in the related issues:

if ((rc = planner_multi_rem_span (subtree_plan, span_it->second)) != 0) {
That error condition (rc !=0) occurs because a partial cancel successfully removes the allocations of the other resource vertices (especially core, which is installed in all pruning filters by default) because they have broker ranks. However, when the final .free RPC fails to remove an ssd vertex allocation the full cleanup cancel exits with an error when it hits the vertices it's already cancelled.

This PR adds support for a cleanup cancel post partial cancel that skips the inapplicable error check for non-existent planner spans in the error criterion for removal of planner_multi spans.

Two related problems needed to be solved: handling partial cancel for brokerless resources when default pruning filters are set (ALL:core) and pruning filters are set for the resources excluded from broker ranks (e.g., ALL:ssd). In preliminary testing, supporting both was challenging because with ALL:core configured, the final .free RPC frees all planner_multi-tracked resources, which prevents a cleanup, full cancel. However, tracking additional resources (e.g., ALL:ssd) successfully removes resource allocations on those vertices only with a cleanup cancel.

This PR adds support for rank-based partial cancel with resources that don't have a rank with the rv1_nosched match format.

Updates: after further investigation, issue #1305 is related as well. This PR also aims to address issue #1309.

jameshcorbett
jameshcorbett previously approved these changes Sep 5, 2024
Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed it makes the error messages I was seeing go away!

@milroy
Copy link
Member Author

milroy commented Sep 5, 2024

@jameshcorbett can I set MWP or do we need a test case to check for this in Fluxion?

It would be good to know if the partial cancel behaves as expected when encountering this issue.

@jameshcorbett
Copy link
Member

Yeah it would be good to have a test case somehow. I put this PR in flux-sched v0.38.0 via a patch so I don't think there's as much hurry to merge it. I'm also working on a test case in flux-coral2 environment but it's not working quite right yet and the tests are slow so it's taking a while.

I can provide a graph and jobspec that will hit the issue, but I don't know about the pattern of partial-cancel RPCs.

@milroy
Copy link
Member Author

milroy commented Sep 5, 2024

After more thought, I think we need to add a testsuite check for this issue.

@jameshcorbett
Copy link
Member

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away. I submitted a bunch of identical rabbit jobs back-to-back and the first couple go through and then one gets stuck in SCHED, which is what I expect to happen when fluxion thinks all of the ssd resources are allocated.

@milroy
Copy link
Member Author

milroy commented Sep 5, 2024

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

What are the scheduler and queue policies set to in the coral2 tests? Edit: I just noticed this PR: flux-framework/flux-coral2#208, but it would be good to have a test in sched, too.

@milroy milroy dismissed jameshcorbett’s stale review September 5, 2024 17:59

Dismissing the first review to make sure this doesn't get merged accidentally. I'll re-request once we understand the behavior better.

@jameshcorbett
Copy link
Member

What are the scheduler and queue policies set to in the coral2 tests?

Whatever the defaults are I think, there's no configuration done.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

Thanks for the pointer, I should be able to look into this tomorrow!

@milroy milroy changed the title Traverser: add valid return in mod_plan for partial cancel when allocation exists Resource: support partial cancel of resources external to broker ranks Oct 17, 2024
@milroy
Copy link
Member Author

milroy commented Oct 17, 2024

@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?

Alternatively I can cherry pick your commits and push them to my fork.

@jameshcorbett
Copy link
Member

@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?

Alternatively I can cherry pick your commits and push them to my fork.

Working on that now.

@milroy
Copy link
Member Author

milroy commented Oct 17, 2024

Could you refactor the tests and put them under t/issues/ so the issue test driver script executes them?

@milroy milroy force-pushed the issue-1284 branch 4 times, most recently from 2abea01 to e155693 Compare October 18, 2024 01:31
Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, a couple of questions and comments.

qmanager/policies/base/queue_policy_base.hpp Outdated Show resolved Hide resolved
// but commonly indicates partial cancel didn't clean up resources external
// to a broker rank (e.g., ssds).
flux_log (flux_h,
LOG_WARNING,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean flux dmesg is going to fill up with these warnings on rabbit systems? (not sure what our log levels are commonly set to)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately I think it does. Open issue #1309 has some pros and cons to warning vs error messages. The potentially large number of messages in flux dmesg is a good argument for making this a DEBUG message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed yeah I think clogging the dmesg logs would be pretty bad :/

} else {
// Add to brokerless resource set to execute in follow-up
// vertex cancel
if ((*m_graph)[u].type == ssd_rt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this force ssds to be a brokerless resource always? Like there could never be ssd resources that belong to brokers (as they do on Sierra I think)?

Copy link
Member Author

@milroy milroy Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this condition is met that means the vertex has an allocation for the jobid in question but it is not one of the vertices canceled via the JGF reader (experimental) or in the broker ranks specified in the partial release .free RPC.

I do wonder if we want to drop the check for (*m_graph)[u].type == ssd_rt. Will there be other resource types not attached to a broker rank, and if so do we just want to cancel them no matter the type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there will be. Especially when we go to doing remote ephemeral Lustre that might be a separate resource that would potentially encompass many ranks. Assuming that tweak, I also approve.

Copy link
Member

@jameshcorbett jameshcorbett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I think, also confirmed it fixes the error in flux-coral2 that flux-framework/flux-coral2#208 adds a check for (which is pretty much the same as the test added in this PR, so nothing too new).

// but commonly indicates partial cancel didn't clean up resources external
// to a broker rank (e.g., ssds).
flux_log (flux_h,
LOG_WARNING,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed yeah I think clogging the dmesg logs would be pretty bad :/

@milroy
Copy link
Member Author

milroy commented Oct 24, 2024

The exception appears to be when processing more than 1 free RPC for a single jobid.

I expect this will be the most common case in production.

In terms of performance difference, I timed the cancellation in qmanager for testsuite tests 1026 and 1027. For cancellations with a single .free RPC for test 1027, the full cleanup cancel takes 102.3 microseconds on average (20 tests) in comparison with the propagating the free state implementation, which takes 121.8 microseconds (19% difference).

For test 1026 with a series of four .free RPCs, the average (across 20 tests) sum of the four cancels is 495.2 microseconds for the full cleanup cancel in comparison with 424.3 microseconds for the propagating the free state implementation (14% difference).

@trws
Copy link
Member

trws commented Oct 24, 2024

It sounds like the complexity favors doing the unconditional full cancel, and the performance doesn't really push away from that much either. Would you agree @milroy? If so, then lets do that. I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.

@milroy
Copy link
Member Author

milroy commented Oct 24, 2024

Would you agree @milroy?

Yes, that's my conclusion, too.

I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.

Agreed. There's extra complexity and brittleness related to dealing with the planners in the context of allocation and resource dynamism and it manifests in unexpected ways.

@milroy milroy force-pushed the issue-1284 branch 2 times, most recently from 6ec98dc to 1a3844a Compare October 24, 2024 23:32
@trws
Copy link
Member

trws commented Oct 28, 2024

Looks like there's a failure on el8, specifically failing to find an error output that's there?

  [Oct24 23:36] sched-fluxion-qmanager[0]: remove: Final .free RPC failed to remove all resources for jobid 88030052352: Success
  test_must_fail: command succeeded: grep free RPC failed to remove all resources log.out
  not ok 8 - no fluxion errors logged

Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.

Is this otherwise ready to go? Trying to plan my reviews etc. for the day.

@milroy
Copy link
Member Author

milroy commented Oct 30, 2024

Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.

Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.

Is this otherwise ready to go? Trying to plan my reviews etc. for the day.

Beyond the CI error you identified, I also need to figure out how to test an error condition related to running the full cleanup cancel after a partial cancel. So no, unfortunately this isn't ready to go yet.

@milroy
Copy link
Member Author

milroy commented Oct 30, 2024

Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.

The final flag on the .free RPC isn't getting set for t1027 on el8 for some reason.

@milroy milroy force-pushed the issue-1284 branch 3 times, most recently from f015e99 to 0b16501 Compare October 31, 2024 00:16
@milroy
Copy link
Member Author

milroy commented Oct 31, 2024

@trws I thought of a much simpler and cleaner way to check whether a planner_multi sub span was removed by a prior partial cancel: a42de67

The PR is ready for a final review.

@milroy milroy force-pushed the issue-1284 branch 4 times, most recently from d20775d to c229143 Compare October 31, 2024 07:20
@milroy
Copy link
Member Author

milroy commented Oct 31, 2024

I should add that I don't think the PR needs additional testsuite tests for whether a planner_multi sub span was removed by a prior partial cancel. The change is very small and should have been included in the first partial cancel PR.

milroy and others added 5 commits October 31, 2024 18:21
Problem: the partial cancel traversal generates an error when
encountering a vertex with an allocation for the jobid in the traversal
that doesn't correspond to a vertex in a broker rank being cancelled.

Skip the error because the allocations will get cleaned up upon receipt
of the final .free RPC.
Problem: a partial cancel successfully removes the allocations of the
other resource vertices (especially core, which is installed in all
pruning filters by default) because they have broker ranks. However,
when the final .free RPC fails to remove an ssd vertex allocation the
full cleanup cancel exits with an error when it hits the vertices it
has already cancelled.

Add support for only removing the planner span if the span ID value
indicates it is a valid existing span.
Problem: the final .free RPC does not free brokerless resources
(e.g., rack-local ssds) with the current implementation of partial
cancel.

Add a full, cleanup cancel upon receipt of the final .free RPC. While
exhibiting slighlty lower performance for a sequence of `.free` RPCs
than supporting brokerless resource release in partial cancel, the full
cancel is not subject to errors under various pruning filter
configurations. Handling and preventing the edge-case errors will
introduce significant complexity into the traverser and planner, and
require updates to REAPI. We can revisit that implementation in the
future if required by performance needs.
Problem: there are no tests for issue flux-framework#1284.

Add two tests: one with a jobspec that uses a top-level 'slot'
and one that does not.
Problem: the current tests for
flux-framework#1284 do not check
to ensure partial cancel behaves as desires with ssd pruning filters.

Add the tests with the ssd pruning filters at all ancestor graph
vertices.
@trws
Copy link
Member

trws commented Nov 1, 2024

Looks good to me @milroy, thanks for all the work on this!

@milroy
Copy link
Member Author

milroy commented Nov 2, 2024

Thanks for the review! Setting MWP.

@milroy milroy added the merge-when-passing mergify.io - merge PR automatically once CI passes label Nov 2, 2024
@mergify mergify bot merged commit 5ac0773 into flux-framework:master Nov 2, 2024
19 of 21 checks passed
Copy link

codecov bot commented Nov 2, 2024

Codecov Report

Attention: Patch coverage is 53.33333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 75.2%. Comparing base (93589f5) to head (655dcc3).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
qmanager/policies/base/queue_policy_base.hpp 45.4% 6 Missing ⚠️
resource/planner/c/planner_multi_c_interface.cpp 66.6% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1292     +/-   ##
========================================
- Coverage    75.3%   75.2%   -0.2%     
========================================
  Files         111     111             
  Lines       15979   15983      +4     
========================================
- Hits        12048   12032     -16     
- Misses       3931    3951     +20     
Files with missing lines Coverage Δ
resource/traversers/dfu_impl_update.cpp 75.2% <100.0%> (-1.2%) ⬇️
resource/planner/c/planner_multi_c_interface.cpp 62.6% <66.6%> (+0.1%) ⬆️
qmanager/policies/base/queue_policy_base.hpp 78.2% <45.4%> (-0.9%) ⬇️

... and 4 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Partial cancel not releasing rabbit resources (?)
3 participants