Performance Benchmark #4221
-
Hello! My question is does someone performed that kind of benchmark? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
We've definitely done a ton of work on performance and have a vast array of measurements, but unfortunately it's a pretty broad topic. that also involves a lot of subtle tradeoffs. For example, Linux sacrifices throughput performance (p50) on process and thread scheduling under load in order to maintain responsiveness (p99). Does that means that Linux has worse performance than QNX? Well… yes, in a sense, but no one would recommend using QNX for a desktop (or even server) operating system! "Performance" is always relative to your target use-case and assumptions. AssumptionsTaking a step back… Cats Effect on the JVM is optimized for long-lived backend network applications deployed in environments with many physical threads. "Many" in this case generally should be taken to mean "8 or more", but really Cats Effect thrives the more you scale it vertically. "Long-lived" should be taken to mean "at least several hours" (i.e. generally not serverless). "Networked" in this case should be taken to mean that most application processing is oriented around data transformations originating from and routing to network socket I/O. (i.e. you're not just loading a dataset into memory off disk and doing a ton of math on it) If you're outside the bounds of this target use-case, you may see good results from Cats Effect, but you also may not. To reify these assumptions a bit, consider the network socket I/O constraint. Network I/O is a fundamentally slow thing (relative to node-local hardware), where a single packet moving between nodes usually takes something on the order of milliseconds at least unless you're doing something specialized like infiniband (in which case… see my point about target use-cases again). If we're making the assumption that most processing begins and/or ends with network socket I/O (potentially involving even more of it along the way), then that scales our order of magnitude for what kind of performance is considered "fast" or "slow". Really, the socket is the speed of light for our application, and trying to go faster than it serves no purpose. Now, there's some nuance to this, since backend network applications often manage many many sockets in parallel (this is actually the whole point of asynchronous I/O), and so when you break it down, the hardware compute time slices for each socket are much smaller than what is allocated to the network itself. This is an important concern! However, there are other hard limits which still put a bound on the granularity of timeslice: file handle limits. Each socket represents a pair of file handles, and these are a very limited resource on any system, but particularly those which are virtualized within container environments (e.g. k8s) since file handles are a shared pool across the entire host. This in turn means that the maximum number of concurrent sockets per process is actually relatively low (usually thousands or tens of thousands), which in turn compresses our speed of light from "order of magnitude milliseconds" to "order of magnitude microseconds". But a microsecond is still a long, long time. On an older Intel I9 processor, Cats Effect's All of this is to say that, in general, when it comes to straight line performance, the bottleneck won't be SchedulingThis is really only one piece of the puzzle though. Another major piece (which turns out to be a lot more impactful in practice) is scheduling overhead. When you're handling many thousands of concurrent connections simultaneously on a small number of physical threads, you need a way of multiplexing and juggling this work such that fairness is respected and an absolute minimum of overhead is imposed. I already alluded to how Linux's kernelspace scheduler makes some specific assumptions in how it tries to solve this problem, and it turns out that those assumptions (timeslice fairness) are a really poor fit for modern networked backend applications. The solution to this is to introduce the notion of a userspace scheduler which handles the multiplexing problem (also known as "m:n scheduling") in a way which biases the implementation more towards throughput and away from responsiveness. We can get away with this because the kernelspace scheduler will always step in if something needs to be truly preemptive, while allowing the userspace scheduler to carry the bulk of the work. In Cats Effect, as in ZIO, Tokio, GoLang, and others, this scheduling is implemented via a work-stealing algorithm, where the units of work are yield sliced coroutines. Scheduling is a really complex topic and we could rabbit hole on it for a long time, but very briefly, the core problem is to make sure that all the requisite compute work is completed in the lowest possible effective time (i.e. time spent by that work actually occupying a physical thread in the underlying hardware), and also the lowest possible wall clock time (i.e. real world time from initial scheduling to completion), and with intermediate progress at a regular rate. These goals are all fundamentally in tension with each other, and so the problem of scheduling is one of innate tradeoffs and assumptions. We choose the set of tradeoffs which optimize for networked backend applications. Additionally, the problem of scheduling is tangled with the problem of directly interacting with the OS kernel (syscall management) and interrupts (notably, timers, but also other forms of hardware bus such as PCIe). The kernel needs to do work on our behalf in order for any of our userspace compute to matter, and that work takes time and resources which themselves need to be scheduled. These problems are inseparable because the kernel brokers the hardware interrupts on our behalf, as well as the program counter register on the CPU itself. Meaning, when we aren't under load, we need to suspend our kernel level threads with a mechanism which is interrupt-aware, and when we are under load we need to make time to periodically cross into kernelspace and gain awareness of any outstanding events. The exact mechanics of this are highly OS and even architecture and version dependent, but they have a massive impact on application performance, particularly in the area of cache management (especially L1 and L2) and kernelspace timeslice granularity. In practice, the difference between doing well at this problem and doing poorly at this problem is not measured in a few percentage points, but rather in multiple orders of magnitude. Put another way, if we optimize Runtime LayeringBetween the layer that users interact with when they write their programs (i.e. the This is the only layer which is addressed by systems like Kotlin's coroutines, or Loom's virtual threads. Focusing on Loom for a moment, a virtual thread basically causes the JVM to add a layer of lifting in between the machine execution and the userspace bytecode, with task boundaries at any blocking call (with a few exceptions, like It is very likely that this disadvantage in coroutine interpretation is temporary. The engineers who work on the JVM are incredibly smart folks, and they're very well aware of all the prior art in this space. I have no doubt that they will optimize this layer quite heavily in time. However, I doubt they will ever be able to exceed the performance of Cats Effect's coroutine layer, simply because what Cats Effect is doing is more or less already at the theoretical limit, not just for how you would encode this functionality on the JVM, but for how you would encode this functionality on x86_64 and ARM hardware. The JVM's access to raw OS primitives and unsafe memory management is very unlikely to yield any material benefits in the fundamental coroutine state machine, and thus my expectation is that, within a few JDK versions, virtual threads will be roughly-speaking as fast as Cats Effect's fibers, but not faster by any measurable amount. Now, virtual threads (and Kotlin's coroutines) do have one significant straight-line performance advantage over Cats Effect's fibers in that their task yield point boundaries are considerably coarser than Cats Effect's. What I mean by this is in Cats Effect, it is idiomatic to write fairly fine-grained This results in the classic "throughput vs fairness" tradeoff in the scheduler. With longer critical sections and larger tasks, each individual task is able to execute much faster on a stable working memory set, and there are fewer polymorphic and conditional jumps to confuse the JIT and CPU's speculative execution, allowing for higher efficiency overall. However, this also means that tasks effectively "hog" the underlying physical resources and it becomes really difficult for the scheduler to help the user. Whether this is good or bad for performance is highly dependent on the problem space, but empirically, we've generally found that tasks about three to four orders of magnitude larger than the typical There are some additional ancillary considerations here, such as the fact that virtual threads do not help with the stack safety problem, cannot implement safe resource handling, and also cannot implement reliable asynchronous interruption semantics (e.g. timeouts). SummaryHopefully you're getting the idea that this is a really complex and nuanced space! :) At the end of the day, IMO, you're not asking the right question. It isn't meaningful to ask "what is the performance of Cats Effect?" any more than it is meaningful to ask "what is the performance of Scala?" Instead, you should ask "what is the performance of my application when built on top of Cats Effect?" But of course, that's not a question we can answer. You need to measure it for yourself. I realize that's a bit of a cop-out, but at the end of the day, that's really the only thing that matters: your application. Anecdotally, I (and many other people) have been using Cats Effect and its predecessors in a lot of different production environments for well over a decade at this point. Major household name products have been and continue to be built on top of it, as well as innumerable applications that are far less well known. Collectively, we have a lot of experience with how the framework behaves at small and large scales and on a wide variety of hardware and usage patterns. I realize I am arguing from authority here, but the truth is that it is this experience which has led directly to the optimization decisions and tradeoffs accepted by Cats Effect and its surrounding ecosystem. It's hard to separate the state of any one specific aspect of the framework today from this vast historical context. At any rate, if you want to dive deeper into any of this stuff, I would honestly recommend cracking open the source code! |
Beta Was this translation helpful? Give feedback.
We've definitely done a ton of work on performance and have a vast array of measurements, but unfortunately it's a pretty broad topic. that also involves a lot of subtle tradeoffs. For example, Linux sacrifices throughput performance (p50) on process and thread scheduling under load in order to maintain responsiveness (p99). Does that means that Linux has worse performance than QNX? Well… yes, in a sense, but no one would recommend using QNX for a desktop (or even server) operating system! "Performance" is always relative to your target use-case and assumptions.
Assumptions
Taking a step back… Cats Effect on the JVM is optimized for long-lived backend network applications deployed in envi…