Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One EventLoop per process #30

Closed
ysbaddaden opened this issue Jul 26, 2024 · 5 comments
Closed

One EventLoop per process #30

ysbaddaden opened this issue Jul 26, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@ysbaddaden
Copy link
Owner

ysbaddaden commented Jul 26, 2024

Instead of one EL per thread (preview_mt) and instead of one EL per EC (#7), we could have one single EL for the whole process!

We currently follow the libevent logic (even in #29) where we repeatedly add and remove a fd from the system epoll or kqueue instance, which means that whenever a fiber would block trying to read (or write) we add the fd to the polling instance, suspend, then on wakeup we remove it, read until we block again and we repeat (add -> wait -> del).

With a single event loop instance, we could add the fd when an IO::FileDescriptor or Socket is created (registering for both read/write) then remove it when we close it. That's it.

Advantages:

  • take full advantage of epoll/kqueue (over poll): no need to repeatedly enqueue/dequeue the fd from epoll/kqueue, we only add/remove a reader/writer for the fd (can use fine grained lock / maybe lock-free list);
  • no need to deal with the fd potentially being in many event loops when closing;
  • we could put the EventQueue::Node right into IO::FileDescriptor & Socket;
  • in fact, no need for an EventQueue to map fd => Node: the Node can directly link the IO object;
  • "thundering herd" isn't an issue: just resume one waiting fiber for the process. Done.

Disadvantages:

  • race condition: we can be notified that a fd is ready for read/write before there is an enqueued reader/writer (might be solved with a flag/lock) [1];
  • one thread will have to deal with everything, so a thread in an execution context will process events for the other contexts;
  • if the fd is in multiple contexts, we might resume a reader/write to a busy context while another is starving (can be considered a bad design: avoid sharing fd across contexts, prefer to move it);
  • we must split & distribute the ready fibers to each involved execution context (instead of quicker local enqueue);

To limit the "one context processes events for another context", there could be a dedicated evloop thread always waiting on the eventloop. Ready fibers would then always be globally enqueued, which might defeat the stealing logic (we can grab more fibers out of the global queue instead of dividing).

@ysbaddaden
Copy link
Owner Author

ysbaddaden commented Jul 26, 2024

Let's note that with many event loops, we could still do the same (enqueue the fd is all event loops), but then all event loops would be notified about the fd readyness, whether they care about it or not, and won't be just once (thanks to edge triggered) but everytime a fiber reached EAGAIN, which would repeatedly interrupt other event loops.

We'd also have to iterate all event loops on each fd/socket open/close, keep a map of fd => node for each event loop, and so on.

@ysbaddaden ysbaddaden added the enhancement New feature or request label Jul 26, 2024
@ysbaddaden
Copy link
Owner Author

We identified different issues with a single evloop per process:

  1. all the threads in a context can park themselves, meaning that another context is now responsible for running the evloop, which may not be run for a while (e.g. context running CPU-bound code) and can introduce delays to notice work is available;
  2. enqueues from the evloop may be cross-context, hence go through global enqueues that require a lock (slow + contention), instead of local enqueues that are lock-free (faster + no contention).

We tried to have a dedicated thread running the evloop to cirsumvent issue 1, but we then fall into issue 2 where all enqueues are always cross context, and performance degrades.

@straight-shoota proposed to still have multiple evloops, and to keep the fd into the evloop only it is initially added to. Enqueues should usually be local, though sometimes they might be to another context. Said differently, an evloop would take ownership of an fd.

A downside is that if you have a server running in a context, creating client connections, doing some user/vhost authentication and so on, then pass the client socket to another context to handle the actual communication, then all enqueues from the evloop would be cross context, which would hinder performance.

To allow this scenario, we proposed to transfer the ownership of an fd to another evloop when a fiber would block in another context: the fd shall be removed from the previous evloop and added to the new evloop.

An advantage is that the fd will follow the context it's running in, enqueues from the evloop should usually be local.

A drawback is that the fd may keep jumping from one evloop to another if it jumps, which means that we'd be back to the current add/remove scheme we currently have with libevent. We also consider this scenario to not happen frequently. Usually a fd should even be owned by a fiber, and only sometimes transferred to another fiber in another context.

An example scenario that wouldn't fare well, is trying to print messages to a socket from different contexts; the fd would keep jumping across contexts, and performance be impacted. But to writing to an IO from different threads is unsafe —even with sync you'll end up with intertwined messages— you'll need something to protect writes (e.g. mutex), in which case you should use a Channel and spawn a dedicated Fiber to do the writes —this is exactly how Log is working. Then, the fd will stop jumping from one evloop to another.

@ysbaddaden
Copy link
Owner Author

I implemented this, codenamed the "lifetime" evloop against the current crystal (without execution contexts). The initial results were attractive, and the semi-final results are impressive:

With the basic HTTP::Server bench, tested with wrk client:

preview_mt/libevent mt:2 => 131k req/s
preview_mt/epoll    mt:2 => 158k req/s (+20%)

preview_mt/libevent mt:4 => 177k req/s
preview_mt/epoll    mt:4 => 214k req/s (+20%)

NOTE: wrk starts N connections and keeps them open to run as many HTTP requests as possible, which is the ideal scenario.

@ysbaddaden ysbaddaden self-assigned this Sep 3, 2024
@RX14
Copy link

RX14 commented Sep 4, 2024

I presume there'll be some EC impls with different event loops kept around to ensure that the design doesn't preclude per-EC event loops or global event loops?

@ysbaddaden
Copy link
Owner Author

This is almost merged: crystal-lang/crystal#14996

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants