-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flux Emulator Revival Questions #6466
Comments
Great that you're trying to push this forward! The simulator has a lot of exciting possibilities for enabling scheduling research (the slurm simulator seems to make regular appearances) and helping us understand and improve flux's scheduler, write test cases, etc.. Have you gotten the old branch working and are you able to run simulations? Doing this and may be taking a stab at a draft description of how it works currently might be a helpful starting point for reviewing the approach in the context of today's Flux. It will probably be a bit annoying to forward port after 5 years of flux development, but I'm not sure anything substantial has really changed in the interfaces between the job manager, the exec system, and the scheduler. There will be lots of little changes though. |
One thing I recall we discussed as this was getting started is this is after the work to port the simulator to the new exec system, so it may not be too bad. I imagine a lot of it is going to be things like handling unsatisfiable jobs or other states that didn't exist yet but need to be factored in. As for 3, I'd say probably yes but don't worry about that yet. We'll have to see how the whole thing ties together to get an idea of what "just works" because of how it's implemented and what we'll want to be able to tweak. |
Looking it over, if the calls from job-manager can become a jobtap plugin (not sure but it looks like it at first glance), that would definitely help. The main thing that might need some thought is how to define "busy" and "quiescent" callbacks in fluxion after everything went asynchronous. It's not quite as easy as it used to be (we used to just process everything in-order, so when the callback ran, it was time) but now it will have to detect when the sched loop has no further work to do. That's actually possible, it puts the event loop to sleep in that case until something comes in, but it's a bit more work. There are also some strange states we didn't really have before, like there being jobs that are satisfiable but not reservable. |
Thank you for the responses. @garlick I was able to get the old branch up and running with the original sharness tests as test cases:
All of these seem to be producing the intended behavior. Then, yes that is what I've been able to observe as well while working through the code. It doesn't seem as if there are any major surprises for me updating this code as-is 🎉. I'm also currently working on materials describing the function of the emulator which, as you mentioned, will be a very helpful resource. Whenever I'm done with that I can send it to you for review if you would like? @trws Taking that into consideration, my plan is to first update the emulator with the functionality as-is. Then, I will work on designing the changes required to support missing functionality such as unsatisfiable/other states. Additionally, I will at this time work on reproducing old work done with the previous emulator design (PRIONN/CanarIO) to see if there are any gaps within the functionality that we are missing. From that information, we could take an informed approach towards major changes (jobtap/etc) and see how those work out. The emulator also does not currently work with Fluxion at all, so that will be something to get running. Currently it is using sched-simple. As far as my development/timeline, it's a bit stalled right now because of SC + Thanksgiving holiday + Class Finals, but when I get my classes finished with, I'll be able to get some significant work done and report back on how everything is going. |
Hello @garlick @trws. I hope you both (and anyone else reading) had a great holiday. I wanted to let the both of you know how things are going and ask a couple questions regarding issues I am having getting the emulator at 1:1 functionality. So to start, I've been able to get the emulator to run properly with a single node configuration. Among many little changes to naming conventions and things of that nature, the major changes I made were:
As of now, there is another change that needs to be made during the post-sim stage regarding collecting job event logs. I have a couple questions that I will put in another comment. |
The first question/issue that I am a bit stuck on is on the resource specification. I have rewritten it to use Rlist from the Python bindings to put R into the KVS and then I reload the resource module and the scheduler so it takes the fake resources. This is working just fine for a single node, but it is not working when I try to add multiple nodes. I have tried several different things, but this is the code I am working with right now: `
On a configuration with 30 nodes of 16 cores each, I get this output for R: As far as I can tell, this looks right, but the broker log says it has available 16 of 480 cores/only node 0. Does anybody have a clue what the problem is here? |
The second question is regarding what I mentioned about the "dummy" event. Essentially, after changing the complete job callback to use the job event journal, I was running into an issue where the emulator would exit prematurely whenever one job finished and the next job was needing to be started. The advance() function would be called before the next job was started, so it would not be in the eventlist and the advance() function would think that there was nothing left to do and exit. A solution that I found to work was to insert a "dummy" job into the eventlist that did nothing in time+1e09s. This would force the emulator to query the scheduler for quiescence again and wait until the job was running. My question is whether this is acceptable as it seems like a sort of band-aid solution. I was thinking it would be alright for now especially if we decided to rewrite the emulator to interact with flux as a jobtap plugin as we would be able to get better control over the job lifecycle at that point. Regarding that, I had a conversation before holiday with @cmoussa1 about jobtap, and I think it would be a great idea to make the emulator a jobtap plugin as it seems like we would get much simpler control over the job lifecycle. This would make the design simpler and also likely entirely decouple the emulator from the rest of flux core (assuming that there isn't some aspect of functionality that I am forgetting). |
One last thing. I am working out of this branch: https://github.com/TauferLab/flux-core-gclab/blob/emulator-jay/src/cmd/flux-emulator.py |
When reloading the core |
@grondo That did the trick. Thanks! |
Not knowing the internals of the emulator at all, this solution sounds fine to me, assuming the |
Hello, I am currently working on getting the Flux emulator (originally simulator) from @SteVwonder (#2561) up and running with the latest version of Flux core. I got the code working with the old Flux core version and am now working on merging the code into the newest version of core. While I am doing that, I figured I would ask a few questions. I discussed these issues with @grondo and @wihobbs at the Flux coffee hour on Nov 22, and they suggested opening an issue so @garlick and others could weigh in. Here are the questions:
Looking at the code from [WIP] Flux simulator #2561 , are there any sections that look like they will need to be completely rewritten because of changes in Flux since original development? From what I can tell, it seems like most things are fairly decoupled and there should only be minor modifications. However, I am fairly new to Flux, so maybe I am missing something.
The original code is missing the ability to handle jobs that are unsatisfiable (Line 259 in flux_simulator.py hangs). I was wondering what the recommended method/tool to implement this would be? From the coffee hour meeting, I was recommended to use either the jobtap plugin or wait on cancel exceptions through RPC (or a combination).
Are there any other important features that should be added that would be useful to implement that would be useful for users of the finished emulator? I am currently wanting to expand on the post-sim analysis a bit and make sure that job timeouts work properly.
Thank you for the advice.
The text was updated successfully, but these errors were encountered: