Skip to content

[embassy-executor]: Upstream "Earliest Deadline First" Scheduler #4035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jamesmunns
Copy link
Contributor

@jamesmunns jamesmunns commented Apr 1, 2025

This PR builds on top of #4033, and shouldn't be merged until that PR is.

This PR also requires hawkw/mycelium#520 to be merged and released before this PR can be merged.

This PR adds the "Deadline Rank Scheduler" "Earliest Deadline First Scheduler", a cooperative (non-preemptive) scheduling algorithm that takes into account the deadline of a task when deciding which tasks in the WAITING state to poll first.

This PR also re-implements the run_queue_atomics implementation to use the cordyceps::TransferStack instead of the previously home-rolled atomic linked list/stack, in order to reduce the duplicated code.

@jamesmunns jamesmunns requested review from lulf and Dirbaio April 1, 2025 17:37
@jamesmunns
Copy link
Contributor Author

CC @hawkw for thoughts

@jamesmunns
Copy link
Contributor Author

https://crates.io/crates/cordyceps/0.3.3 is now published, I'll get this PR updated to use the published crate tomorrow.

@jamesmunns jamesmunns force-pushed the james/upstream-drs branch from b718466 to 46a975b Compare April 8, 2025 08:08
@jamesmunns
Copy link
Contributor Author

Updated with released cordyceps, this should now be reviewable and mergable.

Copy link
Member

@lulf lulf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection from me, seems like a nice improvement with the cordyceps usage as well.

@Dirbaio
Copy link
Member

Dirbaio commented Apr 10, 2025

Some concerns:

  • Deadline scheduling is not implemented in the critical-section run queue, only in the atomics one.
  • I don't think we should add cordyceps as a dep, I'd prefer to keep the data structures self-contained for something as foundational as the executor. Having it all in-tree means we can make changes as needed without having to coordinate across repos, and less abstraction means it's clearer what's going on in this case IMO. (Intrusive data structures aren't easy to abstract in Rust...)
  • dequeue_all is transferring all tasks to a sorted list, then walking that. Another way to do it would be to walk the runqueue itself to find the earliest deadline, pop it, poll it, repeat. Both are O(n^2) but the latter is simpler maybe, because it doesn't require SortedList which might alleviate the need for cordyceps?
  • Where does the "Deadline Rank Sorted" terminology come from? Only hit on Google is literally this PR. Seems the most common term is "Earliest Deadline First" (EDF) scheduling, maybe we should call it that. Or maybe just "deadline scheduler" as Linux does.

@jamesmunns
Copy link
Contributor Author

jamesmunns commented Apr 10, 2025

Appreciate the reviews!

Deadline scheduling is not implemented in the critical-section run queue, only in the atomics one.

Yep, this could be addressed in a future PR, right now we've only been using this on targets with atomics. How we do this might depend on what we decide to do with Cordyceps.

I don't think we should add cordyceps as a dep, I'd prefer to keep the data structures self-contained for something as foundational as the executor. Having it all in-tree means we can make changes as needed without having to coordinate across repos, and less abstraction means it's clearer what's going on in this case IMO.

I'd like to try and make the case for keeping the cordyceps dependency! In it's favor:

  • Cordyceps is fairly exhaustively tested, with both miri and loom, which helps to avoid regressions potentially caused by changes.
  • Cordyceps is also used by the maitake executor, which is certainly less active than embassy, but it does mean that any regressions caught or improvements made are shared. The data structures used, particularly the TransferStack and Stack structures, are exactly designed for the use case embassy's atomic RunQueue had before, meaning that this changeover was basically a drop-in change.
  • Cordyceps does have correct + tested user APIs, for example for dequeuing or iterating over items, meaning that it's less unsafe surface area that anyone making changes in embassy-executor has less unsafe surface area to work on.

We certainly COULD "hard-fork" cordyceps, but by the time that we've brought over the datastructures and tests, we will have essentially vendored a significant part of the crate.

I could also narrowly "backport" the changes needed for DRS on top of the hand-rolled impl that embassy used, but that would be disappointing, IMO, for the testing and shared usage reasons mentioned above.

Intrusive data structures aren't easy to abstract in Rust...

I'd definitely look at the Linked trait used in cordyceps, IMO it is still a bit of an unsafe API, but DOES do a good job of being the basics you need to abstract over intrusive structures. This commonality allowed TaskHeaders to be moved between TransferStack, Stack, and SortedList, as they all had a common singly linked list abstraction.

dequeue_all is transferring all tasks to a sorted list, then walking that. Another way to do it would be to walk the runqueue itself to find the earliest deadline, pop it, poll it, repeat. Both are O(n^2) but the latter is simpler maybe, because it doesn't require SortedList which might alleviate the need for cordyceps?

The reason that it is done that way is that the TransferStack is atomically shared (with interrupts!), while the local Stack is not. This allows us to do one atomic dequeue_all, then not require a mutex/critical section for either the sorting/insertion operation, or the subsequent pop operations.

It's maybe possible to do "search and extract" without a mutex, but it would definitely be "spicier" to reason about. Especially without miri and loom testing.

Where does the "Deadline Rank Sorted" terminology come from? Only hit on Google is literally this PR. Seems the most common term is "Earliest Deadline First" (EDF) scheduling, maybe we should call it that. Or maybe just "deadline scheduler" as Linux does.

It comes from an internal proposal document, though I don't think there's any attachment to the name. We could rename it if you prefer. There's a chance for confusion if we name it something that doesn't work exactly the same, but I'm not too stressed about it. I'll ask internally if there's prior art for the chosen name.

Either way, I'll resolve the merge conflict so it's clean!

* Start hacking in cordyceps

This adds a third kind of runqueue, for now it should work the same as the
current "atomics" runqueue, but uses a cordyceps TransferStack instead of
the existing home-rolled linked list.

* Clean up, use new cordyceps feature

* A bit more cleanup

* Update docs to be more clear
This implements a minimal version of Deadline Rank Scheduling, as well as ways to access and set Deadlines.

This still needs some UX improvements, but is likely Enough for testing.
@jamesmunns jamesmunns changed the title [embassy-executor]: Upstream "Deadline Rank Sorted" Scheduler [embassy-executor]: Upstream "Earliest Deadline First" Scheduler Apr 16, 2025
@jamesmunns
Copy link
Contributor Author

Rebased the PR, and changed the name to "Earliest Deadline First" scheduler.

I also added an example using the scheduler and Deadline interface. I decided to add another example project, since I didn't want to affect any existing examples, and (I think) there isn't really a good way to have required features (for embassy-executor) on a per-bin basis.

Happy to update if there's a better way though!

@jamesmunns
Copy link
Contributor Author

The current basic demo gives this as output, by the way:

Spinning up load tasks...
Starting UNPRIORITIZED worker trials

Starting unprioritized worker
Trial complete, theoretical ticks: 9900, actual ticks: 19094
Took 192.86868% of ideal time

Starting unprioritized worker
Trial complete, theoretical ticks: 9900, actual ticks: 19105
Took 192.9798% of ideal time

Starting unprioritized worker
Trial complete, theoretical ticks: 9900, actual ticks: 19105
Took 192.9798% of ideal time
Starting PRIORITIZED worker trials

Starting prioritized worker
Trial complete, theoretical ticks: 9900, actual ticks: 10063
Took 101.64646% of ideal time

Starting prioritized worker
Trial complete, theoretical ticks: 9900, actual ticks: 10064
Took 101.65657% of ideal time

Starting prioritized worker
Trial complete, theoretical ticks: 9900, actual ticks: 10064
Took 101.65657% of ideal time

Trials Complete.

@jamesmunns
Copy link
Contributor Author

@lulf do you have any thoughts on how to turn the example I added in 6665b72 into a CI-able test?

I see theres some infra for enabling "features" for targets, but i'm not sure if I could make a test that enables a totally separate embassy-executor feature for testing?

@lulf
Copy link
Member

lulf commented Apr 17, 2025

@lulf do you have any thoughts on how to turn the example I added in 6665b72 into a CI-able test?

I'm not sure I understand the end goal, is it to test that the example works, or that the EDF works on nrf? For the latter, I think adding a test in tests/nrf/ with the appropriate // required-features header should be fine. OTOH, is there a need to make a hardware-specific test of it?

I see theres some infra for enabling "features" for targets, but i'm not sure if I could make a test that enables a totally separate embassy-executor feature for testing?

This should work I think, you add a feature in tests/nrf/Cargo.toml named 'edf' which enables it on the embassy-executor dep, and add the // required-features: edf in the test.

As for the other comments on the PR, I'm fine with the cordyceps deps, and actually think it's good to not roll our own (To me, it seems even better to reuse something outside for the additional 'hardening' you get, you can reduce risk by pinning the version).

@jamesmunns
Copy link
Contributor Author

is there a need to make a hardware-specific test of it

Hmm, probably not, mostly just ensuring there is some basic regression testing for the feature, in case something happens in future refactoring. I'll think about it a bit and do a follow-up PR if it makes sense.

As for the other comments on the PR, I'm fine with the cordyceps deps, and actually think it's good to not roll our own (To me, it seems even better to reuse something outside for the additional 'hardening' you get, you can reduce risk by pinning the version).

Agreed :) I'd like to get an ok (at at least a "meh, whatever") from @Dirbaio since he raised objections before I actually hit merge tho.

@Dirbaio
Copy link
Member

Dirbaio commented Apr 17, 2025

Yep, this could be addressed in a future PR, right now we've only been using this on targets with atomics

I would prefer everything to land together. Reasons:

  • To ensure it doesn't stay unimplemented forever.
  • It'll force us to find a design that doesn't duplicate the scheduler code. As I mentioned on Matrix a few weeks ago, you're strongly coupling the scheduler with the queue. This means if we have N schedulers and M queues, we'll end up with O(N*M) code, while ideally it should be O(N+M).

re cordyceps: it's not about its quality (it's great), it's about the complexity and amount of code embassy-executor contributors and maintainers have to deal with. Status quo is a straightforward implementation of a transfer stack with two atomic ops.
Cordyceps brings in the Linked trait which is somewhat nontrivial, and a whole set of containers with its own API surface and docs that you have to get familiar with, and a much larger implementation underneath because it supports more ops beyond what the executor needs.

Also, using cordyceps won't magically make the executor correct. For example, one key requirement is you mustn't enqueue tasks that are already enqueued. This is ensured by State with all the different task states and atomic orderings, which is somewhat nontrivial and has had bugs in the past. Cordyceps would prevent bugs in the queue itself, but we've never had bugs in the queue itself because it's so simple.

If we want to catch these bugs we'd have to add Loom testing to the whole executor, and if we do that then the fact that Cordyceps has loom testing of the data structures in isolation doesn't help that much.

We already do have miri testing btw.

The reason that it is done that way is that the TransferStack is atomically shared (with interrupts!), while the local Stack is not. This allows us to do one atomic dequeue_all, then not require a mutex/critical section for either the sorting/insertion operation, or the subsequent pop operations.

It's maybe possible to do "search and extract" without a mutex, but it would definitely be "spicier" to reason about. Especially without miri and loom testing.

This is a very good point. It should be doable though. When a task is enqueued it's "exclusively owned" by the executor thread until the RUN_QUEUED bit is cleared, so it should be fine to iterate and pop. You only have to worry about contention in the queue head I think.

The reason I think "iterate queue to choose which task to poll" is better is because it makes it easier to decouple the scheduler from the runqueue. A "scheduler" could be a thing that gets to iterate the runqueue and emits a verdict of which task to poll next. The default scheduler would decide to poll the first task its sees, the deadline scheduler would iterate to find the earliest deadline and decide to poll that.

@jamesmunns
Copy link
Contributor Author

you're strongly coupling the scheduler with the queue

This was why I re-implemented the existing atomic runqueue on top of the TransferStack, which is how I originally interpreted your request to de-duplicate. We're would be in the same state if this PR is merged as we are today, there are two runqueues, one atomic, and one non-atomic.

The current code is (I believe) a proof of concept we are already at the point where you can implement different schedulers with the same runqueue - both "regular" and "edf" atomic variants share the same data structure, constructor, and enqueue method, only the dequeue method is different.

This is a very good point. It should be doable though. When a task is enqueued it's "exclusively owned" by the executor thread until the RUN_QUEUED bit is cleared, so it should be fine to iterate and pop. You only have to worry about contention in the queue head I think.

The reason I think "iterate queue to choose which task to poll" is better is because it makes it easier to decouple the scheduler from the runqueue. A "scheduler" could be a thing that gets to iterate the runqueue and emits a verdict of which task to poll next. The default scheduler would decide to poll the first task its sees, the deadline scheduler would iterate to find the earliest deadline and decide to poll that.

Performance wise, I think they will be equivalent, pop sort pop (cordyceps) is basically O(1) + O(n*m) + O(1), while search pop (your proposal) is O(n*m) + O(1). That being said, it could be possible to mitigate the O(n*m) cost using SortedList, as we might be able to "zipper merge" the two lists, only paying the more expensive cost on the "new" items.

I worry about edge cases and complexity when it comes to the search + pop. We'd need to:

  • Make an Option<NonNull<TaskHeader>> for the PARENT of the chosen task to pop
  • If head is none, return
  • If head->next is none, CAS the head we loaded against the head ptr to swap it to None, to see if something new has been inserted
    • if no: return head since we popped it
    • if yes: we need to walk the list from top to figure out how to remove what used to be head, but now should we start over because there are new items to consider?
  • Start at the head, scan through the entire list, checking if cursor.next is what we want, if so, store cursor as the desired parent
  • once cursor.next is null, we need to store off cursor.next, set cursor.next to cursor.next.next, and return the stored off cursor.next

Basically, we would need to reimplement this code:

https://github.com/hawkw/mycelium/blob/9ec5c3350a7f6102d627185977a5f225af1f8b4c/cordyceps/src/sorted_list.rs#L168-L283

With the added complexity that those two initial head special cases are ALSO now CAS loops instead of just regular pointer ops, and we might have to start over or do a more complicated removal if HEAD was moved at all while we do the first two steps.

In general: I think trying to search in-place is going to be MUCH more complex for little gain, performance or API clarity wise. I'd probably push back on this request for the EDF scheduler.

Regarding the "one true waitqueue", would you be interested in a CS-capable version of cordyceps::TransferStack? This would allow us to actually eliminate the separate runqueues that exist today (from the perspective of embassy-executor code), while keeping the same API for both, and would be minimally invasive to the current code structure.

If not, there are two alternatives I can think of:

The first would be to "inline" the functionality in embassy-executor, keeping two separate runqueues, and inventing a new api, essentially identical to cordyceps' TransferQueue, to allow abstraction of the single runqueue for different scheduler implementations.

The second would be to rework the raw module somewhat significantly, primarly making dequeue_all return the detached runqueue parts, allowing the caller to choose the scheduling algorithm for polling the detached linked list. However, this would negatively impact the edf impl, which doesn't process a whole batch of tasks before it goes to detach more, which means that if a high-prio task shows up JUST after we detach a batch, we don't need to process the whole batch, only the first task that was scheduled.

So, overall:

  • I don't think it would be a good idea to pursue "search then pop", for the complexity reasons mentioned above, which would also apply to ALL schedulers, not just the EDF scheduler.
  • Regarding Run Queues, I think there are two choices:
    1. Replace the critical-section runqueue with a cs-capable TransferStack, use that for all runqueues
    2. Reimplement SortedList functionality inside of embassy-executor, remove the cordyceps dependency

I'm more than happy to implement a CS-based version of cordyceps::TransferStack to allow for the simplification of the RunQueue interface, if having cordyceps isn't a hard no from you. Otherwise I'll pursue re-implementing SortedList manually to match the previous atomic runqueue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants