Performance Benchmark #4221

karpinskiJ · 2024-12-31T11:58:41Z

karpinskiJ
Dec 31, 2024

Hello!
I'm kind of rookie when it comes to cats effect.
I've watched a few conference talks and read many articles describing cons of using cats effect and its performance, but I'm struggling to find any benchmark showing its performance (in terms of processing time, cpu and memory usage) comparing to other solutions (as pure project Loom or Kotlin coroutines).

My question is does someone performed that kind of benchmark?
I would appreciate sharing results!

Answered by djspiewak

Dec 31, 2024

We've definitely done a ton of work on performance and have a vast array of measurements, but unfortunately it's a pretty broad topic. that also involves a lot of subtle tradeoffs. For example, Linux sacrifices throughput performance (p50) on process and thread scheduling under load in order to maintain responsiveness (p99). Does that means that Linux has worse performance than QNX? Well… yes, in a sense, but no one would recommend using QNX for a desktop (or even server) operating system! "Performance" is always relative to your target use-case and assumptions.

Assumptions

Taking a step back… Cats Effect on the JVM is optimized for long-lived backend network applications deployed in envi…

View full answer

djspiewak · 2024-12-31T17:26:14Z

djspiewak
Dec 31, 2024
Maintainer

We've definitely done a ton of work on performance and have a vast array of measurements, but unfortunately it's a pretty broad topic. that also involves a lot of subtle tradeoffs. For example, Linux sacrifices throughput performance (p50) on process and thread scheduling under load in order to maintain responsiveness (p99). Does that means that Linux has worse performance than QNX? Well… yes, in a sense, but no one would recommend using QNX for a desktop (or even server) operating system! "Performance" is always relative to your target use-case and assumptions.

Assumptions

Taking a step back… Cats Effect on the JVM is optimized for long-lived backend network applications deployed in environments with many physical threads. "Many" in this case generally should be taken to mean "8 or more", but really Cats Effect thrives the more you scale it vertically. "Long-lived" should be taken to mean "at least several hours" (i.e. generally not serverless). "Networked" in this case should be taken to mean that most application processing is oriented around data transformations originating from and routing to network socket I/O. (i.e. you're not just loading a dataset into memory off disk and doing a ton of math on it) If you're outside the bounds of this target use-case, you may see good results from Cats Effect, but you also may not.

To reify these assumptions a bit, consider the network socket I/O constraint. Network I/O is a fundamentally slow thing (relative to node-local hardware), where a single packet moving between nodes usually takes something on the order of milliseconds at least unless you're doing something specialized like infiniband (in which case… see my point about target use-cases again). If we're making the assumption that most processing begins and/or ends with network socket I/O (potentially involving even more of it along the way), then that scales our order of magnitude for what kind of performance is considered "fast" or "slow". Really, the socket is the speed of light for our application, and trying to go faster than it serves no purpose.

Now, there's some nuance to this, since backend network applications often manage many many sockets in parallel (this is actually the whole point of asynchronous I/O), and so when you break it down, the hardware compute time slices for each socket are much smaller than what is allocated to the network itself. This is an important concern! However, there are other hard limits which still put a bound on the granularity of timeslice: file handle limits. Each socket represents a pair of file handles, and these are a very limited resource on any system, but particularly those which are virtualized within container environments (e.g. k8s) since file handles are a shared pool across the entire host. This in turn means that the maximum number of concurrent sockets per process is actually relatively low (usually thousands or tens of thousands), which in turn compresses our speed of light from "order of magnitude milliseconds" to "order of magnitude microseconds". But a microsecond is still a long, long time.

On an older Intel I9 processor, Cats Effect's IO's flatMap function, including both allocations (the lambda and the IO instance), all interpretation overhead, all GC overhead, etc… runs in about 7 nanoseconds. On my liquid cooled AMD 5950X, it's about 3 nanoseconds. On more modern hardware (such as server-grade x86_64 or Apple Silicon), we're about an order of magnitude faster still. So that implies that, in a realistic application, you can squeeze order-of-magnitude tens of thousands of flatMaps into the handling logic for a single network socket while still remaining under the ceiling. Most connection processing logic, in practice, is a small fraction of that, even when written in a highly functional style with frameworks like Fs2 and Http4s.

All of this is to say that, in general, when it comes to straight line performance, the bottleneck won't be flatMap or any of the associated machinery. This is an important point, because flatMap (and the allocations) tends to get the majority of the attention when people talk about performance, mostly because it's very easy to measure.

Scheduling

This is really only one piece of the puzzle though. Another major piece (which turns out to be a lot more impactful in practice) is scheduling overhead. When you're handling many thousands of concurrent connections simultaneously on a small number of physical threads, you need a way of multiplexing and juggling this work such that fairness is respected and an absolute minimum of overhead is imposed. I already alluded to how Linux's kernelspace scheduler makes some specific assumptions in how it tries to solve this problem, and it turns out that those assumptions (timeslice fairness) are a really poor fit for modern networked backend applications.

The solution to this is to introduce the notion of a userspace scheduler which handles the multiplexing problem (also known as "m:n scheduling") in a way which biases the implementation more towards throughput and away from responsiveness. We can get away with this because the kernelspace scheduler will always step in if something needs to be truly preemptive, while allowing the userspace scheduler to carry the bulk of the work. In Cats Effect, as in ZIO, Tokio, GoLang, and others, this scheduling is implemented via a work-stealing algorithm, where the units of work are yield sliced coroutines.

Scheduling is a really complex topic and we could rabbit hole on it for a long time, but very briefly, the core problem is to make sure that all the requisite compute work is completed in the lowest possible effective time (i.e. time spent by that work actually occupying a physical thread in the underlying hardware), and also the lowest possible wall clock time (i.e. real world time from initial scheduling to completion), and with intermediate progress at a regular rate. These goals are all fundamentally in tension with each other, and so the problem of scheduling is one of innate tradeoffs and assumptions. We choose the set of tradeoffs which optimize for networked backend applications.

Additionally, the problem of scheduling is tangled with the problem of directly interacting with the OS kernel (syscall management) and interrupts (notably, timers, but also other forms of hardware bus such as PCIe). The kernel needs to do work on our behalf in order for any of our userspace compute to matter, and that work takes time and resources which themselves need to be scheduled. These problems are inseparable because the kernel brokers the hardware interrupts on our behalf, as well as the program counter register on the CPU itself. Meaning, when we aren't under load, we need to suspend our kernel level threads with a mechanism which is interrupt-aware, and when we are under load we need to make time to periodically cross into kernelspace and gain awareness of any outstanding events. The exact mechanics of this are highly OS and even architecture and version dependent, but they have a massive impact on application performance, particularly in the area of cache management (especially L1 and L2) and kernelspace timeslice granularity.

In practice, the difference between doing well at this problem and doing poorly at this problem is not measured in a few percentage points, but rather in multiple orders of magnitude. Put another way, if we optimize flatMap we can shave a few fractions of a nanosecond off its runtime, but if we optimize the scheduler we can shave hundreds of microseconds or even a few milliseconds off of the handling of each and every one of your connections. The README has a good (albeit highly simplistic) example of this with some numbers, and similarly so do the recent release notes for 3.6.0-RC1.

Runtime Layering

Between the layer that users interact with when they write their programs (i.e. the IO API) and the scheduler is a third layer: the coroutine interpreter. It is this layer which is responsible for taking user-defined programs (represented as instances of IO) and executing each step in sequence with an absolute minimum of resource overhead, ensuring constant stack utilization, safe resource management, asynchronous interruption, preemption (often called "auto-yielding"), and threadless suspension. This is a hard problem by itself, but also a relatively well-understood one.

This is the only layer which is addressed by systems like Kotlin's coroutines, or Loom's virtual threads. Focusing on Loom for a moment, a virtual thread basically causes the JVM to add a layer of lifting in between the machine execution and the userspace bytecode, with task boundaries at any blocking call (with a few exceptions, like synchronized). So in other words, this coroutine runtime moves into the JVM itself. In theory this should result in considerably higher performance from this layer of the stack, but in practice the JVM seems to be about an order of magnitude or more slower than the coroutine interpreter which is inside of Cats Effect (source: some Ox benchmarks published by SoftwareMill, and some comparative Kyo benchmarks).

It is very likely that this disadvantage in coroutine interpretation is temporary. The engineers who work on the JVM are incredibly smart folks, and they're very well aware of all the prior art in this space. I have no doubt that they will optimize this layer quite heavily in time. However, I doubt they will ever be able to exceed the performance of Cats Effect's coroutine layer, simply because what Cats Effect is doing is more or less already at the theoretical limit, not just for how you would encode this functionality on the JVM, but for how you would encode this functionality on x86_64 and ARM hardware. The JVM's access to raw OS primitives and unsafe memory management is very unlikely to yield any material benefits in the fundamental coroutine state machine, and thus my expectation is that, within a few JDK versions, virtual threads will be roughly-speaking as fast as Cats Effect's fibers, but not faster by any measurable amount.

Now, virtual threads (and Kotlin's coroutines) do have one significant straight-line performance advantage over Cats Effect's fibers in that their task yield point boundaries are considerably coarser than Cats Effect's. What I mean by this is in Cats Effect, it is idiomatic to write fairly fine-grained IOs, glueing them together with flatMap. In this encoding, flatMap effectively serves the same function as semicolon in Java: connecting two sequential statements into a program consisting of both in order. However, in Cats Effect, every flatMap is a safe point at which the coroutine interpreter is free to preempt the fiber and yield to other outstanding work; in Kotlin and Loom, safe points only happen when you reach an asynchronous suspension, which in the case of virtual threads is a thread-blocking operation (such as a socket read()). These are many orders of magnitude less frequent than semicolons, and so safe points are much rarer and there are fewer opportunities to preempt work.

This results in the classic "throughput vs fairness" tradeoff in the scheduler. With longer critical sections and larger tasks, each individual task is able to execute much faster on a stable working memory set, and there are fewer polymorphic and conditional jumps to confuse the JIT and CPU's speculative execution, allowing for higher efficiency overall. However, this also means that tasks effectively "hog" the underlying physical resources and it becomes really difficult for the scheduler to help the user. Whether this is good or bad for performance is highly dependent on the problem space, but empirically, we've generally found that tasks about three to four orders of magnitude larger than the typical IO and many many orders of magnitude smaller than the typical span between asynchronous suspensions produces the best results for most typical applications. (this based on about a decade of painstaking analysis and experimentation on large production applications across many companies)

There are some additional ancillary considerations here, such as the fact that virtual threads do not help with the stack safety problem, cannot implement safe resource handling, and also cannot implement reliable asynchronous interruption semantics (e.g. timeouts).

Summary

Hopefully you're getting the idea that this is a really complex and nuanced space! :)

At the end of the day, IMO, you're not asking the right question. It isn't meaningful to ask "what is the performance of Cats Effect?" any more than it is meaningful to ask "what is the performance of Scala?" Instead, you should ask "what is the performance of my application when built on top of Cats Effect?" But of course, that's not a question we can answer. You need to measure it for yourself. I realize that's a bit of a cop-out, but at the end of the day, that's really the only thing that matters: your application.

Anecdotally, I (and many other people) have been using Cats Effect and its predecessors in a lot of different production environments for well over a decade at this point. Major household name products have been and continue to be built on top of it, as well as innumerable applications that are far less well known. Collectively, we have a lot of experience with how the framework behaves at small and large scales and on a wide variety of hardware and usage patterns. I realize I am arguing from authority here, but the truth is that it is this experience which has led directly to the optimization decisions and tradeoffs accepted by Cats Effect and its surrounding ecosystem. It's hard to separate the state of any one specific aspect of the framework today from this vast historical context.

At any rate, if you want to dive deeper into any of this stuff, I would honestly recommend cracking open the source code! IOFiber is where the coroutine interpreter lives, and WorkStealingThreadPool (and WorkerThread) is the scheduler. There are TONS of comments on all three files and you'll find a lot of insight in there. You can also look at the benchmarks sub-project if you want to see some really artificial microbenchmarks covering a small set of cases. We also work closely with projects like Fs2 and Http4s, which have their own benchmarking methodologies, as well as community members who are able to share results from their own production applications. Much of this collaboration happens in Discord; you should feel free to ask more detailed questions there at any time!

7 replies

djspiewak Jan 2, 2025
Maintainer

The mention of Kyo here can give an impression that it's based on Loom, given that's the case for Ox, and that it's an order of magnitude or more slower than cats-effect's fibers. Could you clarify if that's what you meant?

Ah I see. I certainly didn't mean to imply that. I'm certainly aware that Kyo is not based on Loom.

The benchmarks I was loosely referencing were essentially the following. I recall you posting on Twitter some time ago a set of comparative benchmarks between Kyo at the time and Loom's virtual threads. There was a bunch of relevant handwave about the caveats of comparing apples and oranges, but the results showed that Kyo was significantly faster than virtual threads (iirc about 1-2 orders of magnitude), and at the time, Kyo and Cats Effect were roughly comparable in terms of straight-line performance on the types of measurements you did.

The Ox benchmarks were very different. There they compared STTP on Ox (so, virtual threads) vs STTP on Cats Effect and found a very similar order of magnitude disparity in performance as to what you found. So, extremely different methodology, but similar result. This to me suggests that, despite all the caveats of comparing these types of very different systems, at the very least the order of magnitude disparity is probably about right.

However, I doubt they will ever be able to exceed the performance of Cats Effect's coroutine layer, simply because what Cats Effect is doing is more or less already at the theoretical limit

What is the frame of reference here? Is the implication that no other effect system is able to reach or surpass cats-effect's coroutine/fiber performance?

We need to be a bit careful here, particularly since all of these systems are very very different in how they work, how users apply them, and what guarantees they do or don't offer. I generally try to split userspace runtimes into three layers:

Program Definition
Coroutine Interpreter
Scheduler

Not all systems have all three layers in clean delineation. Virtual threads punt the third layer to FJP, as did Cats Effect 1.x and 2.x (well, we used fixed thread executor, but same idea). Conversely, Scala's Future blurs together program definition and coroutine interpretation. Cats Effect and ZIO have very bright lines between these three layers, though that imposes some costs to go along with the benefits.

Anyway, to be more clear about definitions… Program definition is the API that users interact with. So for Cats Effect, that's IO and all its associated free monad ++ machinery. ZIO is almost identical in this respect. For Kyo, this would be the inline and opaque type DSL and its related harnesses. For virtual threads, this is bytecode itself. Note that flatMap makes reasoning about this stuff complicated, and so I generally think of programs as being defined up to a continuation boundary (i.e. the right side of the bind).

Coroutine interpretation is the actual execution of sequentialized instructions which have been preconstructed as program definitions. In Cats Effect, this is IOFiber and its associated machinery (e.g. ArrayStack). I'll talk more about this in a bit, but this layer tends to be highly undifferentiated because, really, there's kind of only one way to do it. IOFiber looks extremely similar to the Scheme and Standard ML interpreters from decades ago, as well as to the ZIO 1 fiber interpreter, etc etc. LLVM continuations also have a very similar internal mechanism.

Finally, the scheduler is where you take tasks and map them to compute resources, and in the case of Cats Effect (and Netty, and GoLang, etc), also where syscalls are brokered. At this layer, you no longer have coroutines, but tasks, which are generally timesliced sub-segments of coroutines (where "time" can be defined in many ways).

I like splitting up runtimes into these three semantic layers because it becomes a lot easier to talk about the pros and cons of different strategies in each one independently of the others. Kyo and Cats Effect choose a very different set of tradeoffs in their schedulers, for example, and we can discuss those tradeoffs entirely independently of the fact that we also chose different tradeoffs in the program definition layer (Kyo doesn't use a free monad suspension). Put another way, Kyo could just as easily have built a scheduler very similar to Cats Effect's (or even used Cats Effect's scheduler directly!) without making any changes in its program definition or coroutine interpreter. The inverse is likely also true.

Anyway, with that out of the way, comparing to virtual threads 1:1 is really hard because the only functionality it really provides is the coroutine interpreter, while also providing some first-class integration with the JVM such that program definition can be done via bytecode rather than a userspace data structure. (this is another good reason to keep these semantic layers distinct in such discussions) So in the bit that you quoted, I was referencing only the coroutine interpreter. For example, Kyo's program definition layer clearly imposes much less overhead that Cats Effect's free monad heap allocation stew, but that doesn't say anything about the coroutine interpretation.

On coroutine interpretation specifically, I'm very confident in saying that Cats Effect's implementation is basically optimal, and no one is really going to be able to do meaningfully better even with direct access to platform intrinsics. Put another way, the overhead that virtual threads currently impose vis a vis Cats Effect's coroutines will eventually go away, but it will never exceed (by a meaningful amount) Cats Effect's coroutine interpreter performance. Nor will Kyo or ZIO or LLVM or Rust… etc, because ultimately we're all bounded by the nature of the underlying hardware architecture.

This is also why I think the coroutine interpreter really isn't that interesting. :) Like, it's a fun bit of machinery and it's not super trivial to get it right, but it's also highly undifferentiated. We all do the same things here because we all do the optimal things. What's interesting is how we handle program definition and what tradeoffs we choose to make in the scheduling layer (which is to say, what sorts of applications and infrastructural environments we are choosing to optimize for).

fwbrasil Jan 2, 2025

There was a bunch of relevant handwave about the caveats of comparing apples and oranges, but the results showed that Kyo was significantly faster than virtual threads (iirc about 1-2 orders of magnitude), and at the time

Yes, Kyo's benchmark suite shows major performance issues in Loom with more heavyweight async execution. As you mentioned, I'd be careful to assume that'll be the case for long. The main bottleneck is currently the ForkJoinPool, which doesn't have good scalability.

Kyo and Cats Effect were roughly comparable in terms of straight-line performance on the types of measurements you did.

As far as I remember, Kyo's benchmark results consistently outperforms cats-effect since the very beginning and that continues to be the case. I won't be sharing the benchmark results given the past issues I had with the Typelevel community but I'd be more than happy to analyze benchmarks in case you have any evidence of the contrary. Link to Kyo's benchmarks: https://github.com/getkyo/kyo/tree/main/kyo-bench/src/main/scala/kyo/bench

Anyway, to be more clear about definitions… Program definition is the API that users interact with. So for Cats Effect, that's IO and all its associated free monad ++ machinery. ZIO is almost identical in this respect. For Kyo, this would be the inline and opaque type DSL and its related harnesses.

Can you clarify if you're attempting to place Kyo in a different category from cats-effect and ZIO? It's an effect system based on a monad and side effects are properly suspended like the other effect systems. I'd appreciate if you could share concrete examples in case you'd like to challenge that.

Kyo and Cats Effect choose a very different set of tradeoffs in their schedulers, for example, and we can discuss those tradeoffs entirely independently of the fact that we also chose different tradeoffs in the program definition layer (Kyo doesn't use a free monad suspension).

Again, I'd appreciate if you could avoid making assertions regarding Kyo. The interaction would be easier if you phrase these as questions. Kyo's monadic encoding is an innovation but has quite similar characteristics to free monads.

Put another way, Kyo could just as easily have built a scheduler very similar to Cats Effect's (or even used Cats Effect's scheduler directly!) without making any changes in its program definition or coroutine interpreter. The inverse is likely also true.

Yes, that's very true. We introduced an integration to enable using Kyo's scheduler in ZIO applications some time ago and have been observing positive results. I also recently worked on a similar integration for cats-effect. I'd encourage you giving it a try. The results might be surprising, especially in environments with the CPU quotas when the system is under pressure.

Nor will Kyo or ZIO or LLVM or Rust… etc, because ultimately we're all bounded by the nature of the underlying hardware architecture.

Having worked with performance engineering for a few years, I find this assertion odd. It seems you're claiming that IOFiber couldn't possibly be further optimized? Looking at the code, I see several optimization opportunities. I think your mental model might be skewed by isolated benchmarks. In those, it's quite easy for the JIT compiler to optimize the execution of IOFiber but, in more realistic workloads, the code becomes much more dynamic and hot call sites can't be optimized. I introduced getkyo/kyo#621 to populate the JIT profile in Kyo's benchmark suite some time ago. I'd encourage inspecting cats-effect's results with it on as there's a regression.

Kyo's equivalent of IOFiber would be IOTask. The implementation is better optimized with significantly smaller memory footprint. Another important differentiator is that, given the enhanced composability of algebraic effects, the fiber machinery has to handle only async boundaries. Other effects, including IO for side effects, are completely separate from the fiber machinery.

djspiewak Jan 2, 2025
Maintainer

You put a lot forward in your reply and I don't think it's that interesting to get drawn into a back and forth on comparative benchmarks, so just to briefly summarize a response, I have examined your benchmarks in the past and I dispute your conclusions. We have also done deeper analyses on more specific things that you've remarked upon as being flaws within the Cats Effect implementation (I vaguely remember a twitter thread from a year or two ago, but I could be misplacing the timeline) and found your conclusion was precisely the opposite of what the JIT actually does, which is in fact why we wrote the implementation the way we did.

Now, it's plausible that we are misinterpreting your results and/or your words, just as it's very plausible that you are misinterpreting our results and/or our words. Kyo and Cats Effect have very different design goals, which can lead to some confusion (such as when you noted the addition of syscalls in the fiber interpreter loop, which is a tradeoff we chose with intent and comes with other benefits you did not measure). I haven't looked at your scheduler layer, but based on how you described it above I suspect there, too, we simply have different design goals, resulting in different optima in the tradeoff space. For example, we're quite prescriptive about matching the number of physical threads, while it sounds like Kyo's scheduler is resilient to third-party thread contention at the cost of higher overhead. Another example would be that it sounds like we've gone a bit further in biasing towards throughput, while you're more aggressively biasing towards responsiveness (as you noted, similar to how the Linux scheduler is tuned). We made the intentional choice to not do this, and so we have respectively landed at different local optima.

We also come from different backgrounds and different sets of experiences. We've worked at different companies, on different physical infrastructure, and in the context of very different applications and usage patterns. This has (naturally) shaped our thinking in very different ways about how our environment is tuned and what assumptions are or are not valid. For example, we have already made some very specific tuning decisions around CPU shares, particularly in k8s and ECS environments, but we came to very different conclusions than you did. My assumption is that we've simply encountered different cluster topologies and diverse tuning within our physical infrastructure, which in turn leads us to different conclusions about how applications need to behave.

At any rate, to summarize, I'm very aware of your measurements and the conclusions you draw from them, but I draw different conclusions from your measurements just as you draw different conclusions (from our own) based on our measurements, and we both have strong reasoning underlying our positions.

Can you clarify if you're attempting to place Kyo in a different category from cats-effect and ZIO? It's an effect system based on a monad and side effects are properly suspended like the other effect systems. I'd appreciate if you could share concrete examples in case you'd like to challenge that.

ZIO and Cats Effect both use a free monad data structure for program definition (the hallmark here is the IO.FlatMap case class). This comes with a whole series of tradeoffs, with the most notable negative consequence being the performance consequences of the heap allocations. Kyo uses opaque types and inline functions to side-step this problem. It is, by definition, not a free monad, though it is clearly still a monad.

This is exactly my point: free monads are one choice for how one may implement the program definition layer of an async runtime. Netty's visitors are another choice. Kyo's tagless monad is a third choice. Etc etc.

So yes, Kyo is very definitely in a different category, and I believe you would agree this is by design.

Having worked with performance engineering for a few years, I find this assertion odd. It seems you're claiming that IOFiber couldn't possibly be further optimized?

I'm being a bit imprecise in my claim, since IOFiber contains quite a bit which is somewhat specific to IO's implementation. What I'm referring to is specifically the encoding of a tagged continuation stack on the heap using dynamic arrays. This encoding is precisely the same as what you'll find in LLVM if you look at its continuations (actually, their frame encoding is row-based rather than columnar, since they have access to true structs, but you can make some interesting arguments both ways about whether this is optimal). Rust's Future, Go's runtime, etc etc. Literally all of them use the same encoding, because you can't really do any better in a meaningful way. There are things you can shave off here and there, microoptimizations you can perform, but at the end of the day nothing in that layer is really going to matter once you get it to this point, and all of the well-optimized encodings are essentially identical. The real impactful bits are in the scheduler.

fwbrasil Jan 3, 2025

You put a lot forward in your reply and I don't think it's that interesting to get drawn into a back and forth on comparative benchmarks, so just to briefly summarize a response, I have examined your benchmarks in the past and I dispute your conclusions.

I understand you'd prefer to dismiss concrete evidence of cat-effect's shortcomings point blank without a more in-depth analysis. A proper dispute would require elaboration and counter proof, of which you provide none. That's especially odd as you try to allege that cat-effect has reached some imaginary "theoretical limit".

Typelevel at this point is well-known for its viceral reactions with wild allegations of doctored and misleading benchmarks. I'm not sharing the results because the last time I did so to correctly point a performance regression, I was ostracized with derogatory terms like 'sewage'. So yes, I'm also not very interested in interacting. I'd prefer you'd avoid mentioning Kyo or making assertions about it in case you're not willing to have a proper fact-based technical discussion.

found your conclusion was precisely the opposite of what the JIT actually does, which is in fact why we wrote the implementation the way we did.

That's interesting, I haven't seen the result of this analysis. I'd love to clarify if you're able to share it. In case you don't have experience with the internals of JIT compilation, I'd also be happy to show how to obtain JIT compilation graphs and reason about them.

Another example would be that it sounds like we've gone a bit further in biasing towards throughput, while you're more aggressively biasing towards responsiveness (as you noted, similar to how the Linux scheduler is tuned)

That doesn't seem a natural conclusion given the different characteristics of the async runtimes. Am I incorrect that cats-effect requires explicit cooperation between fibers to provide fairness? Do the queues have any prioritization mechanism, an essential aspect for fairness?

Kyo uses opaque types and inline functions to side-step this problem. It is, by definition, not a free monad, though it is clearly still a monad.

Yes, I agree Kyo isn't a regular encoding of free monads and that's by design. What I'd ask you is to avoid is trying to dismiss it by alleging that it's a solution in a different category from cats-effect. It's a common theme in my interactions with Typelevel. That's not the case, it's a functional effect system and a direct competitor.

At any rate, to summarize, I'm very aware of your measurements and the conclusions you draw from them, but I draw different conclusions from your measurements just as you draw different conclusions (from our own) based on our measurements, and we both have strong reasoning underlying our positions.

That's reasonable. I'd warn that any analysis you made in the past is likely not valid anymore. Kyo was essentially fully redesigned this past year.

djspiewak Jan 3, 2025
Maintainer

Am I incorrect that cats-effect requires explicit cooperation between fibers to provide fairness?

This is incorrect. You should look for references to auto-ceding, which is our term for fiber preemption.

Do the queues have any prioritization mechanism, an essential aspect for fairness?

There are several ways to define fairness, simply because there are multiple aspects to it (for example, simple FIFO ordering is, itself, one definition of fairness). Prioritization in the work queue is necessitated by one of them, but not all. For efficiency, we chose not to prioritize this form of fairness. Work-stealing as an algorithm already deprioritizes fairness in favor of rebalancing over workers. Empirically, this seems to be a reasonable choice in the tradeoff spectrum, and all other userspace runtimes which I have studied make the same choice.

The scheduling algorithm you describe for Kyo sounds unique, to my knowledge. I'm not aware of any other userspace runtime which does what you seem to be describing (though as you pointed out, it sounds a lot like Linux, which corresponds to the standard tradeoff optimization choices for non-realtime kernelspace schedulers). To be clear, that doesn't really make it wrong, it just means you're prioritizing different goals.

Yes, I agree Kyo isn't a regular encoding of free monads and that's by design.

Yes, we're saying the same thing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Benchmark #4221

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Performance Benchmark #4221

karpinskiJ Dec 31, 2024

Assumptions

Replies: 1 comment · 7 replies

djspiewak Dec 31, 2024 Maintainer

Assumptions

Scheduling

Runtime Layering

Summary

djspiewak Jan 2, 2025 Maintainer

fwbrasil Jan 2, 2025

djspiewak Jan 2, 2025 Maintainer

fwbrasil Jan 3, 2025

djspiewak Jan 3, 2025 Maintainer

karpinskiJ
Dec 31, 2024

Replies: 1 comment 7 replies

djspiewak
Dec 31, 2024
Maintainer

djspiewak Jan 2, 2025
Maintainer

djspiewak Jan 2, 2025
Maintainer

djspiewak Jan 3, 2025
Maintainer