Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux Emulator Revival Questions #6466

Open
washwor1 opened this issue Dec 2, 2024 · 4 comments
Open

Flux Emulator Revival Questions #6466

washwor1 opened this issue Dec 2, 2024 · 4 comments

Comments

@washwor1
Copy link

washwor1 commented Dec 2, 2024

Hello, I am currently working on getting the Flux emulator (originally simulator) from @SteVwonder (#2561) up and running with the latest version of Flux core. I got the code working with the old Flux core version and am now working on merging the code into the newest version of core. While I am doing that, I figured I would ask a few questions. I discussed these issues with @grondo and @wihobbs at the Flux coffee hour on Nov 22, and they suggested opening an issue so @garlick and others could weigh in. Here are the questions:

  1. Looking at the code from [WIP] Flux simulator #2561 , are there any sections that look like they will need to be completely rewritten because of changes in Flux since original development? From what I can tell, it seems like most things are fairly decoupled and there should only be minor modifications. However, I am fairly new to Flux, so maybe I am missing something.

  2. The original code is missing the ability to handle jobs that are unsatisfiable (Line 259 in flux_simulator.py hangs). I was wondering what the recommended method/tool to implement this would be? From the coffee hour meeting, I was recommended to use either the jobtap plugin or wait on cancel exceptions through RPC (or a combination).

  3. Are there any other important features that should be added that would be useful to implement that would be useful for users of the finished emulator? I am currently wanting to expand on the post-sim analysis a bit and make sure that job timeouts work properly.

Thank you for the advice.

@garlick
Copy link
Member

garlick commented Dec 2, 2024

Great that you're trying to push this forward! The simulator has a lot of exciting possibilities for enabling scheduling research (the slurm simulator seems to make regular appearances) and helping us understand and improve flux's scheduler, write test cases, etc..

Have you gotten the old branch working and are you able to run simulations? Doing this and may be taking a stab at a draft description of how it works currently might be a helpful starting point for reviewing the approach in the context of today's Flux.

It will probably be a bit annoying to forward port after 5 years of flux development, but I'm not sure anything substantial has really changed in the interfaces between the job manager, the exec system, and the scheduler. There will be lots of little changes though.

@trws
Copy link
Member

trws commented Dec 3, 2024

One thing I recall we discussed as this was getting started is this is after the work to port the simulator to the new exec system, so it may not be too bad. I imagine a lot of it is going to be things like handling unsatisfiable jobs or other states that didn't exist yet but need to be factored in.

As for 3, I'd say probably yes but don't worry about that yet. We'll have to see how the whole thing ties together to get an idea of what "just works" because of how it's implemented and what we'll want to be able to tweak.

@trws
Copy link
Member

trws commented Dec 3, 2024

Looking it over, if the calls from job-manager can become a jobtap plugin (not sure but it looks like it at first glance), that would definitely help. The main thing that might need some thought is how to define "busy" and "quiescent" callbacks in fluxion after everything went asynchronous. It's not quite as easy as it used to be (we used to just process everything in-order, so when the callback ran, it was time) but now it will have to detect when the sched loop has no further work to do. That's actually possible, it puts the event loop to sleep in that case until something comes in, but it's a bit more work. There are also some strange states we didn't really have before, like there being jobs that are satisfiable but not reservable.

@washwor1
Copy link
Author

washwor1 commented Dec 5, 2024

Thank you for the responses.

@garlick I was able to get the old branch up and running with the original sharness tests as test cases:

  1. malformed input
  2. single node w/ 10 jobs
  3. 10 Nodes w/ 10 jobs

All of these seem to be producing the intended behavior. Then, yes that is what I've been able to observe as well while working through the code. It doesn't seem as if there are any major surprises for me updating this code as-is 🎉. I'm also currently working on materials describing the function of the emulator which, as you mentioned, will be a very helpful resource. Whenever I'm done with that I can send it to you for review if you would like?

@trws Taking that into consideration, my plan is to first update the emulator with the functionality as-is. Then, I will work on designing the changes required to support missing functionality such as unsatisfiable/other states. Additionally, I will at this time work on reproducing old work done with the previous emulator design (PRIONN/CanarIO) to see if there are any gaps within the functionality that we are missing. From that information, we could take an informed approach towards major changes (jobtap/etc) and see how those work out. The emulator also does not currently work with Fluxion at all, so that will be something to get running. Currently it is using sched-simple.

As far as my development/timeline, it's a bit stalled right now because of SC + Thanksgiving holiday + Class Finals, but when I get my classes finished with, I'll be able to get some significant work done and report back on how everything is going.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants