Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add co-scheduling functionality for GPUs using MPS #1317

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kulnaman
Copy link

@kulnaman kulnaman commented Dec 5, 2024

This PR introduces support for GPU co-scheduling directly within the scheduler framework. Previously, users needed to manually provide their own resource (R), bypassing the scheduler's resource allocation logic. With this update:

  1. Users can specify that a job is eligible for co-scheduling.
  2. The scheduler will automatically handle resource allocation for such jobs.

Currently there is a simple co-scheduling policy(cosched):

  1. Prioritizes GPUs with no running jobs for resource allocation.
  2. Distributes MPS resources efficiently among jobs sharing a GPU.
    One can run the simple policy by simply:
resource-query -L mps_data/small.graphml -F pretty_simple --match-subsystems=CA --match-policy=cosched
resource-query> match allocate mps_data/job_1.yaml

Pending Work:

  1. Assign appropriate MPS % to each job.
  2. Communicate this to flux-core.

kulnaman and others added 4 commits December 5, 2024 16:33
1. Adding cosched and restart flag in the jobspec system attribute.
2. Add MPS partition.
3. If no mps partition in jobspec, automatically add gpu_mps
Copy link

codecov bot commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 39.23077% with 79 lines in your changes missing coverage. Please review.

Project coverage is 71.4%. Comparing base (6a7ecc0) to head (f90cf68).

Files with missing lines Patch % Lines
resource/policies/dfu_match_cosched_aware.cpp 0.0% 60 Missing ⚠️
resource/libjobspec/jobspec.cpp 59.3% 13 Missing ⚠️
resource/policies/dfu_match_policy_factory.cpp 33.3% 2 Missing ⚠️
resource/traversers/dfu_impl_update.cpp 33.3% 2 Missing ⚠️
resource/traversers/dfu.cpp 85.7% 1 Missing ⚠️
resource/traversers/dfu_impl.cpp 90.0% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #1317     +/-   ##
========================================
- Coverage    75.2%   71.4%   -3.8%     
========================================
  Files         111     112      +1     
  Lines       15986   16105    +119     
========================================
- Hits        12029   11510    -519     
- Misses       3957    4595    +638     
Files with missing lines Coverage Δ
resource/evaluators/scoring_api.cpp 88.0% <100.0%> (+2.6%) ⬆️
resource/evaluators/scoring_api.hpp 81.2% <ø> (ø)
resource/libjobspec/jobspec.hpp 100.0% <ø> (ø)
resource/modules/resource_match.cpp 59.8% <ø> (-9.7%) ⬇️
resource/policies/base/dfu_match_cb.cpp 35.4% <100.0%> (+2.8%) ⬆️
resource/policies/dfu_match_locality.cpp 8.0% <ø> (ø)
resource/policies/dfu_match_multilevel_id.hpp 0.0% <ø> (ø)
resource/policies/dfu_match_multilevel_id_impl.hpp 81.9% <ø> (ø)
resource/policies/dfu_match_var_aware.cpp 87.2% <ø> (ø)
resource/utilities/command.cpp 74.4% <100.0%> (-3.2%) ⬇️
... and 7 more

... and 17 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant