-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource: support partial cancel of resources external to broker ranks #1292
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed it makes the error messages I was seeing go away!
@jameshcorbett can I set MWP or do we need a test case to check for this in Fluxion? It would be good to know if the partial cancel behaves as expected when encountering this issue. |
Yeah it would be good to have a test case somehow. I put this PR in flux-sched v0.38.0 via a patch so I don't think there's as much hurry to merge it. I'm also working on a test case in flux-coral2 environment but it's not working quite right yet and the tests are slow so it's taking a while. I can provide a graph and jobspec that will hit the issue, but I don't know about the pattern of partial-cancel RPCs. |
After more thought, I think we need to add a testsuite check for this issue. |
Some of my flux-coral2 tests are suggesting to me that the rabbit resources aren't freed, even though the error message goes away. I submitted a bunch of identical rabbit jobs back-to-back and the first couple go through and then one gets stuck in SCHED, which is what I expect to happen when fluxion thinks all of the |
Let's make a reproducer similar to this one: https://github.com/flux-framework/flux-sched/blob/master/t/issues/t1129-match-overflow.sh What are the scheduler and queue policies set to in the coral2 tests? Edit: I just noticed this PR: flux-framework/flux-coral2#208, but it would be good to have a test in sched, too. |
Dismissing the first review to make sure this doesn't get merged accidentally. I'll re-request once we understand the behavior better.
Whatever the defaults are I think, there's no configuration done.
Thanks for the pointer, I should be able to look into this tomorrow! |
@jameshcorbett this PR is almost ready for review. First, I'd like to integrate the tests you wrote that reproduced the behavior and helped me fix it. Can you make a PR to my fork on the issue-1284 branch (https://github.com/milroy/flux-sched/tree/issue-1284)? Alternatively I can cherry pick your commits and push them to my fork. |
Working on that now. |
Could you refactor the tests and put them under |
2abea01
to
e155693
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, a couple of questions and comments.
// but commonly indicates partial cancel didn't clean up resources external | ||
// to a broker rank (e.g., ssds). | ||
flux_log (flux_h, | ||
LOG_WARNING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean flux dmesg
is going to fill up with these warnings on rabbit systems? (not sure what our log levels are commonly set to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unfortunately I think it does. Open issue #1309 has some pros and cons to warning vs error messages. The potentially large number of messages in flux dmesg
is a good argument for making this a DEBUG
message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed yeah I think clogging the dmesg
logs would be pretty bad :/
} else { | ||
// Add to brokerless resource set to execute in follow-up | ||
// vertex cancel | ||
if ((*m_graph)[u].type == ssd_rt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this force ssds to be a brokerless resource always? Like there could never be ssd
resources that belong to brokers (as they do on Sierra I think)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this condition is met that means the vertex has an allocation for the jobid
in question but it is not one of the vertices canceled via the JGF reader (experimental) or in the broker ranks specified in the partial release .free
RPC.
I do wonder if we want to drop the check for (*m_graph)[u].type == ssd_rt
. Will there be other resource types not attached to a broker rank, and if so do we just want to cancel them no matter the type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there will be. Especially when we go to doing remote ephemeral Lustre that might be a separate resource that would potentially encompass many ranks. Assuming that tweak, I also approve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I think, also confirmed it fixes the error in flux-coral2 that flux-framework/flux-coral2#208 adds a check for (which is pretty much the same as the test added in this PR, so nothing too new).
// but commonly indicates partial cancel didn't clean up resources external | ||
// to a broker rank (e.g., ssds). | ||
flux_log (flux_h, | ||
LOG_WARNING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed yeah I think clogging the dmesg
logs would be pretty bad :/
I expect this will be the most common case in production. In terms of performance difference, I timed the cancellation in qmanager for testsuite tests 1026 and 1027. For cancellations with a single For test 1026 with a series of four |
It sounds like the complexity favors doing the unconditional full cancel, and the performance doesn't really push away from that much either. Would you agree @milroy? If so, then lets do that. I have a feeling we should keep this in mind as we think through what we want to do when evolving the planner setup. |
Yes, that's my conclusion, too.
Agreed. There's extra complexity and brittleness related to dealing with the planners in the context of allocation and resource dynamism and it manifests in unexpected ways. |
6ec98dc
to
1a3844a
Compare
Looks like there's a failure on el8, specifically failing to find an error output that's there?
Maybe it's going to stdout instead of stderr and getting missed? Not sure, seems odd that it's only there and not on the other ones though. Is this otherwise ready to go? Trying to plan my reviews etc. for the day. |
Unfortunately this looks like an error in the job removal. I don't yet know why it's only happening on el8.
Beyond the CI error you identified, I also need to figure out how to test an error condition related to running the full cleanup cancel after a partial cancel. So no, unfortunately this isn't ready to go yet. |
The |
f015e99
to
0b16501
Compare
d20775d
to
c229143
Compare
I should add that I don't think the PR needs additional testsuite tests for whether a planner_multi sub span was removed by a prior partial cancel. The change is very small and should have been included in the first partial cancel PR. |
Problem: the partial cancel traversal generates an error when encountering a vertex with an allocation for the jobid in the traversal that doesn't correspond to a vertex in a broker rank being cancelled. Skip the error because the allocations will get cleaned up upon receipt of the final .free RPC.
Problem: a partial cancel successfully removes the allocations of the other resource vertices (especially core, which is installed in all pruning filters by default) because they have broker ranks. However, when the final .free RPC fails to remove an ssd vertex allocation the full cleanup cancel exits with an error when it hits the vertices it has already cancelled. Add support for only removing the planner span if the span ID value indicates it is a valid existing span.
Problem: the final .free RPC does not free brokerless resources (e.g., rack-local ssds) with the current implementation of partial cancel. Add a full, cleanup cancel upon receipt of the final .free RPC. While exhibiting slighlty lower performance for a sequence of `.free` RPCs than supporting brokerless resource release in partial cancel, the full cancel is not subject to errors under various pruning filter configurations. Handling and preventing the edge-case errors will introduce significant complexity into the traverser and planner, and require updates to REAPI. We can revisit that implementation in the future if required by performance needs.
Problem: there are no tests for issue flux-framework#1284. Add two tests: one with a jobspec that uses a top-level 'slot' and one that does not.
Problem: the current tests for flux-framework#1284 do not check to ensure partial cancel behaves as desires with ssd pruning filters. Add the tests with the ssd pruning filters at all ancestor graph vertices.
Looks good to me @milroy, thanks for all the work on this! |
Thanks for the review! Setting MWP. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1292 +/- ##
========================================
- Coverage 75.3% 75.2% -0.2%
========================================
Files 111 111
Lines 15979 15983 +4
========================================
- Hits 12048 12032 -16
- Misses 3931 3951 +20
|
Issue #1284 identified a problem where rabbits are not released due to a traverser error during partial cancellation. The traverser should skip the rest of the
mod_plan
function when an allocation is found andmod_data.mod_type == job_modify_t::PARTIAL_CANCEL
.This PR adds aThis PR is significantly updated in scope to reflect more comprehensive understanding of the problem.goto
statement to return0
under this circumstance.This line is causing some of the errors reported in the related issues:
flux-sched/resource/traversers/dfu_impl_update.cpp
Line 436 in 996f999
rc !=0
) occurs because a partial cancel successfully removes the allocations of the other resource vertices (especiallycore
, which is installed in all pruning filters by default) because they have broker ranks. However, when the final.free
RPC fails to remove anssd
vertex allocation the full cleanup cancel exits with an error when it hits the vertices it's already cancelled.This PR adds support for a cleanup
cancel
post partial cancel that skips the inapplicable error check for non-existentplanner
spans in the error criterion for removal ofplanner_multi
spans.Two related problems needed to be solved: handling partial cancel for brokerless resources when default pruning filters are set (
ALL:core
) and pruning filters are set for the resources excluded from broker ranks (e.g.,ALL:ssd
). In preliminary testing, supporting both was challenging because withALL:core
configured, the final.free
RPC frees allplanner_multi
-tracked resources, which prevents a cleanup, fullcancel
. However, tracking additional resources (e.g.,ALL:ssd
) successfully removes resource allocations on those vertices only with a cleanup cancel.This PR adds support for rank-based partial cancel with resources that don't have a rank with the rv1_nosched match format.
Updates: after further investigation, issue #1305 is related as well. This PR also aims to address issue #1309.