Performance Optimizations #467

lihaoyi · 2023-03-13T13:39:21Z

This PR attempts to optimize the uPickle read/write logic while preserving binary compatibility

Code Changes

Generally avoid calling BaseElemRenderer.getElemSafe and ElemBuilder.appendUnsafe in hot paths. Instead, in both cases we try to call requestUntilGetSafeIndex/ensureLength and then read/write as many characters as we can directly on the byte/char array.
ElemBuilder.arr and ElemBuilder.length public so that they can be used directly outside of upickle.core
Adjust requestUntil0 to properly handle cases where readDataIntoBuffer returns n = -1
Write integers directly to output without needing to .toString a temporary string
Read (most) integers directly from input without needing to .toString a temporary string
In ArrayWriter and SeqLikeWriter, lift ctx.subVisitor out of the loop since it never changes
Separated out "fast-path" simple-acii-string loop from "slow-path" escape/unicode loop during rendering and parsing
Remove redundant flushBuffer and flushElemBuilder calls inside visitFloat32/visitFloat64
Fixed a bug in WrapElemArrayCharSeq.charAt which was not correct when Elem=Byte
Added new visitFloat64{Char,Byte}Parts visitor methods which are optimized versions of visitFloat64StringParts.
Fix a bug in StringWriter where it was not accepting visitFloat64StringParts calls, even though it accepts all the other visitFloat* and visitInt* calls.
Directly write unicode characters to UTF-8 bytes without going through an intermediate makeString: String and .getBytes: Array[Byte]
Vendored/patched the fast Schubfach Double/Float serialization code from Jackson-Core. This is similar to the version used in OpenJDK, but OpenJDK's is GPLed (rather than permissive) and thus cannot be used. Furthermore, even if a future version of Java includes the fast double serializer, vendoring/patching it allows us to write the output directly to the output char/byte array, saving us an intermediate string and byte array. For now, this is just for Scala-JVM, to allow us to re-use the Java sources with minimal changes while still providing perf on the platform where it matters the most

Testing Changes

Added some micro-benchmarks to the bench/ folder, to better tease out differences that only affect specific code paths
Moved slow unit tests into dedicated .testSlow module, to get them out of the fast path
Added some fuzz tests reading/writing a variety of Strings/Numbers of different lengths, to try and catch any bugs that the trivial unit tests don't

Benchmarks

git checkout optimize && ./mill bench.jvm.run && git checkout main && git checkout optimize bench && ./mill bench.jvm.run on my M1 Macbook pro, Scala 2.13.10, Java 11

"macro" benchmarks upickleDefault Read/Write show 16-33% speedups when handling JSON
"micro" benchmarks show large speedups in integers Write, and the string handling operations in general
The micro-benchmarks show a minor across-the-board speedup, across all benchmarks, that seems evident despite the random variation.

Main -> PR (higher is better)

As of 66d9af8

Name	JsonString	JsonByteArray	MsgPack
upickleDefault Read	1432 -> 1625 (+13%)	1120 -> 923 (-17%)	2438 -> 2324 (-4%)
upickleDefault Write	1602 -> 2060 (+28%)	1553 -> 1889 (+21%)	2755 -> 2613 (-5%)
integers Read	196864 -> 285963 (+45%)	229698 -> 246587 (+7%)	592572 -> 580823 (-1%)
integers Write	296029 -> 396122 (+33%)	273371 -> 385921 (+41%)	653991 -> 666033 (+1%)
doubles Read	86299 -> 97270 (+12%)	92724 -> 52961 (-42%)	673140 -> 650428 (-3%)
doubles Write	89059 -> 142984 (+60%)	88151 -> 140816 (+59%)	591629 -> 655322 (+10%)
shortStrings Read	232302 -> 301723 (+29%)	212627 -> 226610 (+6%)	295688 -> 375804 (+27%)
shortStrings Write	306868 -> 337049 (+9%)	320702 -> 368960 (+15%)	574461 -> 562958 (-2%)
longStrings Read	9598 -> 17142 (+78%)	11041 -> 17451 (+58%)	200153 -> 220143 (+9%)
longStrings Write	11010 -> 12785 (+16%)	9490 -> 15009 (+58%)	101027 -> 100749 (0%)
unicodeStrings Read	71976 -> 135159 (+87%)	26527 -> 44733 (+68%)	62247 -> 95197 (+52%)
unicodeStrings Write	75241 -> 70620 (-6%)	56140 -> 50046 (-10%)	119705 -> 119706 (0%)
caseClasses Read	4376 -> 4122 (-5%)	3852 -> 3949 (+2%)	5025 -> 4417 (-12%)
caseClasses Write	7450 -> 7457 (0%)	7966 -> 7869 (-1%)	6856 -> 6476 (-5%)
sequences Read	1135 -> 1263 (+11%)	1228 -> 1240 (0%)	1615 -> 1632 (+1%)
sequences Write	1420 -> 1559 (+9%)	1191 -> 1298 (+8%)	1466 -> 1561 (+6%)

Notes

A lot of function signatures changed or were moved around. I left @deprecated forwarder methods where necessary to keep binary compatibility
For now, some possible performance gains are still left on the table: custom double parsing, vectorized parsing, etc.
For now, most changes are pretty localized. There's probably some optimization possibilities that would involve larger changes to the core interfaces
The performance effects of changes can be pretty surprising. A number of things that I thought would speed things up actually slowed things down (e.g. making visitFloat64StringParts/visitString take Array[Byte/Char] rather than CharSequence, or making the escapeByte's String -> UTF8 Bytes conversion direct rather than through an intermediate CharBuilder)

Per-commit rough benchmarks

5-second-long rough benchmarks on each significant commit, to see the numbers changing over time

Bench	R/W	8e1	7bf	5c3	cb5	0b3	632	600	848	49b
upickleDefault	Read	678	745	700	688	697	711	841	807	789
upickleDefault	Write	910	975	1012	1028	1013	1016	963	1030	922
upickleDefaultByteArray	Read	629	661	613	607	650	617	700	716	684
upickleDefaultByteArray	Write	852	1018	1036	1018	1069	943	937	948	946
upickleDefaultBinary	Read	1312	1293	1317	1308	1290	1302	1291	1321	1305
upickleDefaultBinary	Write	1397	1325	1298	1282	1334	1343	1330	1302	1413
integers	Read	89334	92446	90720	105653	101459	100886	100721	99277	142552
integers	Write	198271	198352	174247	207267	209200	202796	179766	203459	204256
integersByteArray	Read	105246	102440	112112	115220	111878	113574	116542	112830	155813
integersByteArray	Write	188687	188913	197311	192353	181826	197057	165848	196274	187304
integersBinary	Read	282797	142942	276694	279362	282047	286733	278783	278341	297563
integersBinary	Write	325789	327924	338787	317988	306883	337490	263196	333712	312733
doubles	Read	39963	46040	41880	45239	42929	42325	42941	42855	47863
doubles	Write	71034	68354	59554	70768	70380	71103	66641	71240	70470
doublesByteArray	Read	46845	47178	43895	46609	46309	44951	44367	44563	40914
doublesByteArray	Write	69656	69132	70225	70212	68752	70929	66443	70732	68820
doublesBinary	Read	321771	501575	323180	315392	320393	318485	318434	325906	329329
doublesBinary	Write	295796	295993	321326	321252	283749	329617	259816	323299	300292
shortStrings	Read	121639	130327	128765	113842	121797	120442	158563	129727	141102
shortStrings	Write	158975	178526	182496	183346	178737	169459	164988	170451	164864
shortStringsByteArray	Read	114474	116461	115078	123167	121479	106868	124290	126967	121628
shortStringsByteArray	Write	168558	205763	212324	198424	210045	184594	188127	187928	185694
shortStringsBinary	Read	130088	134634	126799	151912	152714	151418	192270	203078	201106
shortStringsBinary	Write	281902	285634	286099	278813	280846	281251	284006	278087	281807
longStrings	Read	4757	4870	4829	4795	4877	4859	8617	7530	8698
longStrings	Write	5601	9533	9566	9635	9542	6397	6268	6404	6243
longStringsByteArray	Read	5512	5522	5454	5403	5507	5192	7583	7574	8939
longStringsByteArray	Write	4728	11756	11979	11398	11962	7487	7482	7460	7411
longStringsBinary	Read	103204	109631	104458	102516	103513	104587	111451	109700	109303
longStringsBinary	Write	50642	50240	50739	50601	50720	50158	50588	50120	49964
unicodeStrings	Read	37295	36067	38745	38058	38096	37983	61195	52783	53140
unicodeStrings	Write	38483	76162	76756	86370	76720	33131	33731	31363	31383
unicodeStringsByteArray	Read	15022	13299	14876	14836	13414	14910	20655	17170	23055
unicodeStringsByteArray	Write	22951	43731	44270	43590	44064	24946	24584	23989	24514
unicodeStringsBinary	Read	44506	31734	43731	45995	31111	43678	47326	32642	47521
unicodeStringsBinary	Write	59195	60047	60002	59369	59611	59796	60004	59083	59675
caseClasses	Read	2507	2375	2253	2569	2549	2304	2171	2641	2245
caseClasses	Write	3403	3046	3296	3169	3486	3707	3591	3720	3649
caseClassesByteArray	Read	2031	2323	1917	2822	2530	2409	2596	2699	2641
caseClassesByteArray	Write	3545	3342	3828	3218	3905	3916	3196	4023	3852
caseClassesBinary	Read	2829	2452	2451	3038	2486	2336	2458	2574	2664
caseClassesBinary	Write	3803	3870	4005	3839	3446	3439	3743	3622	3517
sequences	Read	545	658	576	565	573	576	669	673	568
sequences	Write	733	717	585	776	745	796	583	794	733
sequencesByteArray	Read	709	714	625	697	717	701	704	699	695
sequencesByteArray	Write	621	618	659	645	573	659	524	655	609
sequencesBinary	Read	820	836	615	810	818	790	811	827	777
sequencesBinary	Write	740	737	769	764	699	799	614	776	725

lihaoyi · 2023-03-17T03:45:50Z

Review by @lefou @lolgab @plokhotnyuk if anyone wants to have a look through before I merge it.

It's a bit of a messy PR, but a lot of it is pretty boilerplatey code, so hopefully not too dense. Scala 3's inline defs would help a bit to alleviate the boilerplate a bit, some day in the far future when we drop Scala 2 support.

The test suite is green, the MIMA bin-compat checks pass, and the benchmarks seem to indicate good improvements.

plokhotnyuk · 2023-03-17T07:53:36Z

A good improvement would be switching to JMH for benchmarking. It will prevent mistakes in measurements when following best practices from JMH samples and will give a good tool for measurement of secondary metrics like other CPU events or JVM allocations (using built-in perfnorm and gc profilers).

@lihaoyi Do you consider an integration of visitor API to work with jsoniter-scala-core? I'm ready to add a missing parsing or serialization methods in jsoniter-scala-core API as it was done recently for hasRemaining.

lefou · 2023-03-17T07:57:34Z

https://com-lihaoyi.github.io/mill/mill/Plugin_Jmh.html

lihaoyi · 2023-03-17T08:05:40Z

Yeah using JMH would be nice. The current benchmark suite is pretty old.

I haven't thought seriously about integrating with jsoniter. I've poked around (e.g. when looking for a double serializer to use), but it's a pretty intimidating codebase so i didn't get as far as figuring out where the possible integration points are

plokhotnyuk · 2023-03-17T08:14:32Z

Yeah using JMH would be nice. The current benchmark suite is pretty old.

As a workaround, for the first time you can publish locally a snapshot version and use JMH benchmarks from jsoniter-scala-benchmarkJS and jsoniter-scala-benchmarkJVM sub-projects.

I haven't thought seriously about integrating with jsoniter. I've poked around (e.g. when looking for a double serializer to use), but it's a pretty intimidating codebase so i didn't get as far as figuring out where the possible integration points are.

Here is an attempt to do that in weePickle's codebase. Probably with cooperation from both sides, we can achieve quite competitive solutions like those jsoniter-scala-circe and play-json-jsoniter boosters.

lihaoyi · 2023-03-17T09:21:13Z

@plokhotnyuk what's the story for JMH and Scala.js/Scala-Native? IIRC one of the reasons I had my own benchmarking code in Scala is so I can easily re-use the benchmarks across platforms. Is JMH only for Scala-JVM? Or is there some cross-platform support too?

plokhotnyuk · 2023-03-17T09:31:06Z

@plokhotnyuk what's the story for JMH and Scala.js/Scala-Native? IIRC one of the reasons I had my own benchmarking code in Scala is so I can easily re-use the benchmarks across platforms. Is JMH only for Scala-JVM? Or is there some cross-platform support too?

In jsoniter-scala-benchmarkJS sub-project I use scalajs-benchmark (a Scala.js port of JMH for browsers).

For Scala Native I have only a rough measurements on validation of huge JSON files using an application from examples.

AFAIK @armanbilge have started porting of JMH to Scala for cross-platform support in his scala-jmh repo. I think the Scala community could sponsor his work on that.

…e classes

…ng numbers

…g, for performance (???)

lihaoyi · 2023-03-19T15:18:08Z

core/templates/ElemUtils.scala

+    // If we do not copy the data from `cs0` into our own local array before
+    // parsing it, we take a significant performance penalty in the
+    // `integers Read` benchmarks, but *only* when run together with the rest
+    // of the benchmarks! When `integers Read` isrun alone, this does not
+    // happen, and is presumably something to do with the JIT compiler.
+
+    // Since any real world use case would exercise all sorts of code paths,
+    // it would more closely resemble the "all benchmarks together" case
+    // rather, and so we leave this copy in-place to optimize performance
+    // for that scenario.


This is especially weird behavior that I don't understand. @plokhotnyuk have you seen anything like this before in your own performance work? I'm aware of things like the JIT causing method calls to slow down when running a bunch of code paths cause callsites to become megamorphic, but I've never heard of JIT-related slowdowns when reading and writing entries directly into arrays like this. And yet, this behavior seems reproducible when I tried it on on OpenJDK Correto 11/17/19

Both way of testing are suitable to understand the parser work: the separated one to see how parsing of some data type works in isolation, and the combined one to see more realistic scenarios that usually based on real-world samples.

From the start of development jsoniter-scala doesn't have any predefined codecs and as much as possible use private[this] methods for generated codecs, so that the type-class interface is generated only for the top-level class of the nested data structures.

@plokhotnyuk the unusual thing is that in both cases, the benchmark runs the parser in a tight single-threaded loop again and again in isolation (nothing else running in the JVM/OS).

The difference is that when the benchmark is run on a "clean" JVM right after startup, this code is fast. When the JVM is run on a "dirty" JVM that has already run a bunch of other benchmarks on the same thread (but which have all since terminated), this code is slow. That's the part that's rather unusual

lihaoyi changed the title ~~Some micro-optimizations~~ Performance Optimizations Mar 17, 2023

lihaoyi marked this pull request as ready for review March 17, 2023 03:11

lihaoyi requested review from lolgab and lefou March 17, 2023 03:37

lihaoyi added 7 commits March 19, 2023 14:02

add testSlow

c669894

fast double parser and BaseElemRenderer.scala updates

8e193dd

fast string writers

7bfd10c

MsgPackWriter improvements

5c33531

BufferingElemParser changes

cb59f0b

properly make longReader accept non-trivial strings

ec6d100

de-virtualize CaseClassReader.write0

0b37e95

plokhotnyuk mentioned this pull request Mar 19, 2023

test benchmarks with Jackson 2.15.0-rc1 plokhotnyuk/jsoniter-scala#1003

Closed

lihaoyi added 2 commits March 19, 2023 15:30

avoid allocating a objectAttributeKeyWriteMap lambda when writing cas…

6329d73

…e classes

use checkSafeIndex for string parsing

600979c

lihaoyi force-pushed the optimize branch from 3939991 to 600979c Compare March 19, 2023 10:31

lihaoyi added 6 commits March 19, 2023 19:11

fix tests

e1b3c4e

fix tests by introducing WrapElemArrayCharSeq.NonUnicode

84827e0

make parser use specialized visitFloat64ElemParts callback for handli…

fc104f5

…ng numbers

ensure visitFloat64ElemParts implementations copy array before parsin…

49b6461

…g, for performance (???)

cleanup

66d9af8

add back bounds check to parseLong

05f8b98

lihaoyi force-pushed the optimize branch from c2db310 to 05f8b98 Compare March 19, 2023 15:16

lihaoyi commented Mar 19, 2023

View reviewed changes

lihaoyi added 2 commits March 20, 2023 00:37

fix scala 3 compile

4e1c08b

mima

8aec3b7

lihaoyi merged commit 0a53ce8 into main Mar 20, 2023

lefou added this to the 3.1.0 milestone Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimizations #467

Performance Optimizations #467

lihaoyi commented Mar 13, 2023 •

edited

Loading

lihaoyi commented Mar 17, 2023

plokhotnyuk commented Mar 17, 2023 •

edited

Loading

lefou commented Mar 17, 2023

lihaoyi commented Mar 17, 2023

plokhotnyuk commented Mar 17, 2023 •

edited

Loading

lihaoyi commented Mar 17, 2023

plokhotnyuk commented Mar 17, 2023 •

edited

Loading

lihaoyi Mar 19, 2023 •

edited

Loading

plokhotnyuk Mar 19, 2023

lihaoyi Mar 20, 2023

Performance Optimizations #467

Performance Optimizations #467

Conversation

lihaoyi commented Mar 13, 2023 • edited Loading

Code Changes

Testing Changes

Benchmarks

Notes

Per-commit rough benchmarks

lihaoyi commented Mar 17, 2023

plokhotnyuk commented Mar 17, 2023 • edited Loading

lefou commented Mar 17, 2023

lihaoyi commented Mar 17, 2023

plokhotnyuk commented Mar 17, 2023 • edited Loading

lihaoyi commented Mar 17, 2023

plokhotnyuk commented Mar 17, 2023 • edited Loading

lihaoyi Mar 19, 2023 • edited Loading

Choose a reason for hiding this comment

plokhotnyuk Mar 19, 2023

Choose a reason for hiding this comment

lihaoyi Mar 20, 2023

Choose a reason for hiding this comment

lihaoyi commented Mar 13, 2023 •

edited

Loading

plokhotnyuk commented Mar 17, 2023 •

edited

Loading

plokhotnyuk commented Mar 17, 2023 •

edited

Loading

plokhotnyuk commented Mar 17, 2023 •

edited

Loading

lihaoyi Mar 19, 2023 •

edited

Loading