Resource: support partial cancel of resources external to broker ranks #1292

milroy · 2024-09-05T00:57:43Z

Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the mod_plan function when an allocation is found and mod_data.mod_type == job_modify_t::PARTIAL_CANCEL. ~~This PR adds a goto statement to return 0 under this circumstance.~~ This PR is significantly updated in scope to reflect more comprehensive understanding of the problem.

This line is causing some of the errors reported in the related issues:

flux-sched/resource/traversers/dfu_impl_update.cpp

Line 436 in 996f999

if ((rc = planner_multi_rem_span (subtree_plan, span_it->second)) != 0) {

That error condition (rc !=0) occurs because a partial cancel successfully removes the allocations of the other resource vertices (especially core, which is installed in all pruning filters by default) because they have broker ranks. However, when the final .free RPC fails to remove an ssd vertex allocation the full cleanup cancel exits with an error when it hits the vertices it's already cancelled.

This PR adds support for a cleanup cancel post partial cancel that skips the inapplicable error check for non-existent planner spans in the error criterion for removal of planner_multi spans.

Two related problems needed to be solved: handling partial cancel for brokerless resources when default pruning filters are set (ALL:core) and pruning filters are set for the resources excluded from broker ranks (e.g., ALL:ssd). In preliminary testing, supporting both was challenging because with ALL:core configured, the final .free RPC frees all planner_multi-tracked resources, which prevents a cleanup, full cancel. However, tracking additional resources (e.g., ALL:ssd) successfully removes resource allocations on those vertices only with a cleanup cancel.

This PR adds support for rank-based partial cancel with resources that don't have a rank with the rv1_nosched match format.

Updates: after further investigation, issue #1305 is related as well. This PR also aims to address issue #1309.

jameshcorbett

Confirmed it makes the error messages I was seeing go away!

milroy · 2024-09-05T01:58:04Z

@jameshcorbett can I set MWP or do we need a test case to check for this in Fluxion?

It would be good to know if the partial cancel behaves as expected when encountering this issue.

jameshcorbett · 2024-09-05T04:45:48Z

Yeah it would be good to have a test case somehow. I put this PR in flux-sched v0.38.0 via a patch so I don't think there's as much hurry to merge it. I'm also working on a test case in flux-coral2 environment but it's not working quite right yet and the tests are slow so it's taking a while.

I can provide a graph and jobspec that will hit the issue, but I don't know about the pattern of partial-cancel RPCs.

milroy · 2024-09-05T04:49:10Z

After more thought, I think we need to add a testsuite check for this issue.

jameshcorbett · 2024-09-05T15:16:24Z

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away. I submitted a bunch of identical rabbit jobs back-to-back and the first couple go through and then one gets stuck in SCHED, which is what I expect to happen when fluxion thinks all of the ssd resources are allocated.

milroy · 2024-09-05T17:47:01Z

Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

What are the scheduler and queue policies set to in the coral2 tests? Edit: I just noticed this PR: flux-framework/flux-coral2#208, but it would be good to have a test in sched, too.

Dismissing the first review to make sure this doesn't get merged accidentally. I'll re-request once we understand the behavior better.

jameshcorbett · 2024-09-06T06:05:08Z

What are the scheduler and queue policies set to in the coral2 tests?

Whatever the defaults are I think, there's no configuration done.

Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh

Thanks for the pointer, I should be able to look into this tomorrow!

milroy · 2024-10-17T07:36:00Z

@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?

Alternatively I can cherry pick your commits and push them to my fork.

jameshcorbett · 2024-10-17T19:49:09Z

@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)?

Alternatively I can cherry pick your commits and push them to my fork.

Working on that now.

milroy · 2024-10-17T19:52:13Z

Could you refactor the tests and put them under t/issues/ so the issue test driver script executes them?

resource/traversers/dfu_impl_update.cpp

jameshcorbett

Looks reasonable, a couple of questions and comments.

qmanager/policies/base/queue_policy_base.hpp

jameshcorbett · 2024-10-18T20:24:28Z

qmanager/policies/base/queue_policy_base.hpp

+                        // but commonly indicates partial cancel didn't clean up resources external
+                        // to a broker rank (e.g., ssds).
+                        flux_log (flux_h,
+                                  LOG_WARNING,


Does this mean flux dmesg is going to fill up with these warnings on rabbit systems? (not sure what our log levels are commonly set to)

Yes, unfortunately I think it does. Open issue #1309 has some pros and cons to warning vs error messages. The potentially large number of messages in flux dmesg is a good argument for making this a DEBUG message.

Agreed yeah I think clogging the dmesg logs would be pretty bad :/

jameshcorbett · 2024-10-18T20:32:50Z

resource/traversers/dfu_impl_update.cpp

+        } else {
+            // Add to brokerless resource set to execute in follow-up
+            // vertex cancel
+            if ((*m_graph)[u].type == ssd_rt)


Does this force ssds to be a brokerless resource always? Like there could never be ssd resources that belong to brokers (as they do on Sierra I think)?

If this condition is met that means the vertex has an allocation for the jobid in question but it is not one of the vertices canceled via the JGF reader (experimental) or in the broker ranks specified in the partial release .free RPC.

I do wonder if we want to drop the check for (*m_graph)[u].type == ssd_rt. Will there be other resource types not attached to a broker rank, and if so do we just want to cancel them no matter the type?

I think there will be. Especially when we go to doing remote ephemeral Lustre that might be a separate resource that would potentially encompass many ranks. Assuming that tweak, I also approve.

jameshcorbett

LGTM I think, also confirmed it fixes the error in flux-coral2 that flux-framework/flux-coral2#208 adds a check for (which is pretty much the same as the test added in this PR, so nothing too new).

jameshcorbett · 2024-10-19T02:06:42Z

qmanager/policies/base/queue_policy_base.hpp

+                        // but commonly indicates partial cancel didn't clean up resources external
+                        // to a broker rank (e.g., ssds).
+                        flux_log (flux_h,
+                                  LOG_WARNING,


Agreed yeah I think clogging the dmesg logs would be pretty bad :/

milroy · 2024-10-24T19:08:48Z

The exception appears to be when processing more than 1 free RPC for a single jobid.

I expect this will be the most common case in production.

In terms of performance difference, I timed the cancellation in qmanager for testsuite tests 1026 and 1027. For cancellations with a single .free RPC for test 1027, the full cleanup cancel takes 102.3 microseconds on average (20 tests) in comparison with the propagating the free state implementation, which takes 121.8 microseconds (19% difference).

For test 1026 with a series of four .free RPCs, the average (across 20 tests) sum of the four cancels is 495.2 microseconds for the full cleanup cancel in comparison with 424.3 microseconds for the propagating the free state implementation (14% difference).

trws · 2024-10-24T19:50:49Z

It sounds like the complexity favors doing the unconditional full cancel, and the performance doesn't really push away from that much either. Would you agree @milroy? If so, then lets do that. I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.

milroy · 2024-10-24T22:04:49Z

Would you agree @milroy?

Yes, that's my conclusion, too.

I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup.

Agreed. There's extra complexity and brittleness related to dealing with the planners in the context of allocation and resource dynamism and it manifests in unexpected ways.

trws · 2024-10-28T16:02:35Z

Looks like there's a failure on el8, specifically failing to find an error output that's there?

  [Oct24 23:36] sched-fluxion-qmanager[0]: remove: Final .free RPC failed to remove all resources for jobid 88030052352: Success
  test_must_fail: command succeeded: grep free RPC failed to remove all resources log.out
  not ok 8 - no fluxion errors logged

Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.

Is this otherwise ready to go? Trying to plan my reviews etc. for the day.

milroy · 2024-10-30T01:06:23Z

Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though.

Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.

Is this otherwise ready to go? Trying to plan my reviews etc. for the day.

Beyond the CI error you identified, I also need to figure out how to test an error condition related to running the full cleanup cancel after a partial cancel. So no, unfortunately this isn't ready to go yet.

milroy · 2024-10-30T02:03:25Z

Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.

The final flag on the .free RPC isn't getting set for t1027 on el8 for some reason.

milroy · 2024-10-31T00:22:36Z

@trws I thought of a much simpler and cleaner way to check whether a planner_multi sub span was removed by a prior partial cancel: a42de67

The PR is ready for a final review.

milroy · 2024-10-31T07:25:32Z

I should add that I don't think the PR needs additional testsuite tests for whether a planner_multi sub span was removed by a prior partial cancel. The change is very small and should have been included in the first partial cancel PR.

Problem: the partial cancel traversal generates an error when encountering a vertex with an allocation for the jobid in the traversal that doesn't correspond to a vertex in a broker rank being cancelled. Skip the error because the allocations will get cleaned up upon receipt of the final .free RPC.

Problem: a partial cancel successfully removes the allocations of the other resource vertices (especially core, which is installed in all pruning filters by default) because they have broker ranks. However, when the final .free RPC fails to remove an ssd vertex allocation the full cleanup cancel exits with an error when it hits the vertices it has already cancelled. Add support for only removing the planner span if the span ID value indicates it is a valid existing span.

Problem: the final .free RPC does not free brokerless resources (e.g., rack-local ssds) with the current implementation of partial cancel. Add a full, cleanup cancel upon receipt of the final .free RPC. While exhibiting slighlty lower performance for a sequence of `.free` RPCs than supporting brokerless resource release in partial cancel, the full cancel is not subject to errors under various pruning filter configurations. Handling and preventing the edge-case errors will introduce significant complexity into the traverser and planner, and require updates to REAPI. We can revisit that implementation in the future if required by performance needs.

Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.

Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.

trws · 2024-11-01T20:40:18Z

Looks good to me @milroy, thanks for all the work on this!

milroy · 2024-11-02T00:31:46Z

Thanks for the review! Setting MWP.

codecov · 2024-11-02T00:32:55Z

Codecov Report

Attention: Patch coverage is 53.33333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 75.2%. Comparing base (93589f5) to head (655dcc3).
Report is 6 commits behind head on master.

Files with missing lines	Patch %	Lines
qmanager/policies/base/queue_policy_base.hpp	45.4%	6 Missing ⚠️
resource/planner/c/planner_multi_c_interface.cpp	66.6%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #1292     +/-   ##
========================================
- Coverage    75.3%   75.2%   -0.2%     
========================================
  Files         111     111             
  Lines       15979   15983      +4     
========================================
- Hits        12048   12032     -16     
- Misses       3931    3951     +20

Files with missing lines	Coverage Δ
resource/traversers/dfu_impl_update.cpp	`75.2% <100.0%> (-1.2%)`	⬇️
resource/planner/c/planner_multi_c_interface.cpp	`62.6% <66.6%> (+0.1%)`	⬆️
qmanager/policies/base/queue_policy_base.hpp	`78.2% <45.4%> (-0.9%)`	⬇️

... and 4 files with indirect coverage changes

milroy force-pushed the issue-1284 branch from b1ba338 to 558b6fe Compare September 5, 2024 01:01

milroy requested a review from jameshcorbett September 5, 2024 01:02

jameshcorbett previously approved these changes Sep 5, 2024

View reviewed changes

milroy force-pushed the issue-1284 branch from 558b6fe to 5a0f657 Compare September 11, 2024 22:51

milroy force-pushed the issue-1284 branch from 5a0f657 to 5203c6c Compare September 19, 2024 21:43

milroy force-pushed the issue-1284 branch from 5203c6c to 2607e1a Compare October 14, 2024 06:00

milroy changed the title ~~Traverser: add valid return in mod_plan for partial cancel when allocation exists~~ Resource: support partial cancel of resources external to broker ranks Oct 17, 2024

milroy linked an issue Oct 17, 2024 that may be closed by this pull request

Partial cancel not releasing rabbit resources (?) #1284

Closed

milroy mentioned this pull request Oct 17, 2024

Partial cancel not releasing rabbit resources (?) #1284

Closed

milroy force-pushed the issue-1284 branch from dcf93cd to 256d297 Compare October 17, 2024 18:58

trws reviewed Oct 17, 2024

View reviewed changes

resource/traversers/dfu_impl_update.cpp Outdated Show resolved Hide resolved

milroy force-pushed the issue-1284 branch 4 times, most recently from 2abea01 to e155693 Compare October 18, 2024 01:31

jameshcorbett reviewed Oct 18, 2024

View reviewed changes

milroy force-pushed the issue-1284 branch from 84aec29 to 6458cef Compare October 18, 2024 23:35

milroy requested review from trws and jameshcorbett October 18, 2024 23:46

jameshcorbett approved these changes Oct 19, 2024

View reviewed changes

milroy force-pushed the issue-1284 branch 2 times, most recently from 6ec98dc to 1a3844a Compare October 24, 2024 23:32

milroy force-pushed the issue-1284 branch 3 times, most recently from f015e99 to 0b16501 Compare October 31, 2024 00:16

milroy force-pushed the issue-1284 branch 4 times, most recently from d20775d to c229143 Compare October 31, 2024 07:20

milroy and others added 5 commits October 31, 2024 18:21

test: add a test for issue flux-framework#1284

08f2582

Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.

testsuite: add partial release tests with ssd pruning filters

655dcc3

Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.

milroy force-pushed the issue-1284 branch from c229143 to 655dcc3 Compare November 1, 2024 01:21

trws approved these changes Nov 1, 2024

View reviewed changes

milroy added the merge-when-passing mergify.io - merge PR automatically once CI passes label Nov 2, 2024

mergify bot merged commit 5ac0773 into flux-framework:master Nov 2, 2024
19 of 21 checks passed

milroy mentioned this pull request Nov 5, 2024

partial cancel: transition partial cancel final error to warning #1309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource: support partial cancel of resources external to broker ranks #1292

Resource: support partial cancel of resources external to broker ranks #1292

milroy commented Sep 5, 2024 •

edited

Loading

jameshcorbett left a comment

milroy commented Sep 5, 2024 •

edited

Loading

jameshcorbett commented Sep 5, 2024

milroy commented Sep 5, 2024

jameshcorbett commented Sep 5, 2024

milroy commented Sep 5, 2024 •

edited

Loading

jameshcorbett commented Sep 6, 2024

milroy commented Oct 17, 2024

jameshcorbett commented Oct 17, 2024

milroy commented Oct 17, 2024

jameshcorbett left a comment

jameshcorbett Oct 18, 2024

milroy Oct 18, 2024

jameshcorbett Oct 19, 2024

jameshcorbett Oct 18, 2024

milroy Oct 18, 2024 •

edited

Loading

trws Oct 19, 2024

jameshcorbett left a comment

jameshcorbett Oct 19, 2024

milroy commented Oct 24, 2024

trws commented Oct 24, 2024

milroy commented Oct 24, 2024

trws commented Oct 28, 2024

milroy commented Oct 30, 2024 •

edited

Loading

milroy commented Oct 30, 2024 •

edited

Loading

milroy commented Oct 31, 2024

milroy commented Oct 31, 2024

trws commented Nov 1, 2024

milroy commented Nov 2, 2024

codecov bot commented Nov 2, 2024

Resource: support partial cancel of resources external to broker ranks #1292

Resource: support partial cancel of resources external to broker ranks #1292

Conversation

milroy commented Sep 5, 2024 • edited Loading

jameshcorbett left a comment

Choose a reason for hiding this comment

milroy commented Sep 5, 2024 • edited Loading

jameshcorbett commented Sep 5, 2024

milroy commented Sep 5, 2024

jameshcorbett commented Sep 5, 2024

milroy commented Sep 5, 2024 • edited Loading

jameshcorbett commented Sep 6, 2024

milroy commented Oct 17, 2024

jameshcorbett commented Oct 17, 2024

milroy commented Oct 17, 2024

jameshcorbett left a comment

Choose a reason for hiding this comment

jameshcorbett Oct 18, 2024

Choose a reason for hiding this comment

milroy Oct 18, 2024

Choose a reason for hiding this comment

jameshcorbett Oct 19, 2024

Choose a reason for hiding this comment

jameshcorbett Oct 18, 2024

Choose a reason for hiding this comment

milroy Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

trws Oct 19, 2024

Choose a reason for hiding this comment

jameshcorbett left a comment

Choose a reason for hiding this comment

jameshcorbett Oct 19, 2024

Choose a reason for hiding this comment

milroy commented Oct 24, 2024

trws commented Oct 24, 2024

milroy commented Oct 24, 2024

trws commented Oct 28, 2024

milroy commented Oct 30, 2024 • edited Loading

milroy commented Oct 30, 2024 • edited Loading

milroy commented Oct 31, 2024

milroy commented Oct 31, 2024

trws commented Nov 1, 2024

milroy commented Nov 2, 2024

codecov bot commented Nov 2, 2024

Codecov Report

milroy commented Sep 5, 2024 •

edited

Loading

milroy commented Sep 5, 2024 •

edited

Loading

milroy commented Sep 5, 2024 •

edited

Loading

milroy Oct 18, 2024 •

edited

Loading

milroy commented Oct 30, 2024 •

edited

Loading

milroy commented Oct 30, 2024 •

edited

Loading