-
-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Optimizations #467
Conversation
Review by @lefou @lolgab @plokhotnyuk if anyone wants to have a look through before I merge it. It's a bit of a messy PR, but a lot of it is pretty boilerplatey code, so hopefully not too dense. Scala 3's The test suite is green, the MIMA bin-compat checks pass, and the benchmarks seem to indicate good improvements. |
A good improvement would be switching to JMH for benchmarking. It will prevent mistakes in measurements when following best practices from JMH samples and will give a good tool for measurement of secondary metrics like other CPU events or JVM allocations (using built-in @lihaoyi Do you consider an integration of visitor API to work with jsoniter-scala-core? I'm ready to add a missing parsing or serialization methods in jsoniter-scala-core API as it was done recently for |
Yeah using JMH would be nice. The current benchmark suite is pretty old. I haven't thought seriously about integrating with jsoniter. I've poked around (e.g. when looking for a double serializer to use), but it's a pretty intimidating codebase so i didn't get as far as figuring out where the possible integration points are |
As a workaround, for the first time you can publish locally a snapshot version and use JMH benchmarks from
Here is an attempt to do that in weePickle's codebase. Probably with cooperation from both sides, we can achieve quite competitive solutions like those jsoniter-scala-circe and play-json-jsoniter boosters. |
@plokhotnyuk what's the story for JMH and Scala.js/Scala-Native? IIRC one of the reasons I had my own benchmarking code in Scala is so I can easily re-use the benchmarks across platforms. Is JMH only for Scala-JVM? Or is there some cross-platform support too? |
In For Scala Native I have only a rough measurements on validation of huge JSON files using an application from examples. AFAIK @armanbilge have started porting of JMH to Scala for cross-platform support in his scala-jmh repo. I think the Scala community could sponsor his work on that. |
…g, for performance (???)
// If we do not copy the data from `cs0` into our own local array before | ||
// parsing it, we take a significant performance penalty in the | ||
// `integers Read` benchmarks, but *only* when run together with the rest | ||
// of the benchmarks! When `integers Read` isrun alone, this does not | ||
// happen, and is presumably something to do with the JIT compiler. | ||
|
||
// Since any real world use case would exercise all sorts of code paths, | ||
// it would more closely resemble the "all benchmarks together" case | ||
// rather, and so we leave this copy in-place to optimize performance | ||
// for that scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is especially weird behavior that I don't understand. @plokhotnyuk have you seen anything like this before in your own performance work? I'm aware of things like the JIT causing method calls to slow down when running a bunch of code paths cause callsites to become megamorphic, but I've never heard of JIT-related slowdowns when reading and writing entries directly into arrays like this. And yet, this behavior seems reproducible when I tried it on on OpenJDK Correto 11/17/19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both way of testing are suitable to understand the parser work: the separated one to see how parsing of some data type works in isolation, and the combined one to see more realistic scenarios that usually based on real-world samples.
From the start of development jsoniter-scala doesn't have any predefined codecs and as much as possible use private[this]
methods for generated codecs, so that the type-class interface is generated only for the top-level class of the nested data structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plokhotnyuk the unusual thing is that in both cases, the benchmark runs the parser in a tight single-threaded loop again and again in isolation (nothing else running in the JVM/OS).
The difference is that when the benchmark is run on a "clean" JVM right after startup, this code is fast. When the JVM is run on a "dirty" JVM that has already run a bunch of other benchmarks on the same thread (but which have all since terminated), this code is slow. That's the part that's rather unusual
This PR attempts to optimize the uPickle read/write logic while preserving binary compatibility
Code Changes
Generally avoid calling
BaseElemRenderer.getElemSafe
andElemBuilder.appendUnsafe
in hot paths. Instead, in both cases we try to callrequestUntilGetSafeIndex
/ensureLength
and then read/write as many characters as we can directly on the byte/char array.ElemBuilder.arr
andElemBuilder.length
public so that they can be used directly outside ofupickle.core
Adjust
requestUntil0
to properly handle cases wherereadDataIntoBuffer
returnsn = -1
Write integers directly to output without needing to
.toString
a temporary stringRead (most) integers directly from input without needing to
.toString
a temporary stringIn
ArrayWriter
andSeqLikeWriter
, liftctx.subVisitor
out of the loop since it never changesSeparated out "fast-path" simple-acii-string loop from "slow-path" escape/unicode loop during rendering and parsing
Remove redundant
flushBuffer
andflushElemBuilder
calls insidevisitFloat32
/visitFloat64
Fixed a bug in
WrapElemArrayCharSeq.charAt
which was not correct whenElem=Byte
Added new
visitFloat64{Char,Byte}Parts
visitor methods which are optimized versions ofvisitFloat64StringParts
.Fix a bug in
StringWriter
where it was not acceptingvisitFloat64StringParts
calls, even though it accepts all the othervisitFloat*
andvisitInt*
calls.Directly write unicode characters to UTF-8 bytes without going through an intermediate
makeString: String
and.getBytes: Array[Byte]
Vendored/patched the fast Schubfach Double/Float serialization code from Jackson-Core. This is similar to the version used in OpenJDK, but OpenJDK's is GPLed (rather than permissive) and thus cannot be used. Furthermore, even if a future version of Java includes the fast double serializer, vendoring/patching it allows us to write the output directly to the output char/byte array, saving us an intermediate string and byte array. For now, this is just for Scala-JVM, to allow us to re-use the Java sources with minimal changes while still providing perf on the platform where it matters the most
Testing Changes
Added some micro-benchmarks to the
bench/
folder, to better tease out differences that only affect specific code pathsMoved slow unit tests into dedicated
.testSlow
module, to get them out of the fast pathAdded some fuzz tests reading/writing a variety of Strings/Numbers of different lengths, to try and catch any bugs that the trivial unit tests don't
Benchmarks
git checkout optimize && ./mill bench.jvm.run && git checkout main && git checkout optimize bench && ./mill bench.jvm.run
on my M1 Macbook pro, Scala 2.13.10, Java 11"macro" benchmarks
upickleDefault Read/Write
show 16-33% speedups when handling JSON"micro" benchmarks show large speedups in
integers Write
, and the string handling operations in generalThe micro-benchmarks show a minor across-the-board speedup, across all benchmarks, that seems evident despite the random variation.
Main -> PR (higher is better)
As of 66d9af8
Notes
A lot of function signatures changed or were moved around. I left
@deprecated
forwarder methods where necessary to keep binary compatibilityFor now, some possible performance gains are still left on the table: custom double parsing, vectorized parsing, etc.
For now, most changes are pretty localized. There's probably some optimization possibilities that would involve larger changes to the core interfaces
The performance effects of changes can be pretty surprising. A number of things that I thought would speed things up actually slowed things down (e.g. making
visitFloat64StringParts
/visitString
takeArray[Byte/Char]
rather thanCharSequence
, or making theescapeByte
's String -> UTF8 Bytes conversion direct rather than through an intermediateCharBuilder
)Per-commit rough benchmarks
5-second-long rough benchmarks on each significant commit, to see the numbers changing over time