Add weejson-jsoniter-scala #105

htmldoug · 2021-12-19T07:32:37Z

Summary

Adds basic integration for the very fast jsoniter-scala parser.

Benchmarks

My benchmarks on doppelganger JSON files with the parsers integrated into weepickle (possibly suboptimally)
Official jsoniter-scala benchmarks using the JSON specialized macros instead of the weepickle macros

Optional Jar

Based on performance alone, I'd love to have this parser as weepickle's default UTF-8 implementation, but:

I'm not giving up jackson-core integration (particularly for dataformats-binary and dataformats-text).
Having two default parser dependencies is slightly "wat"
I'm not sure if jsoniter-scala v3 will use the same package names and cause us dependency hell. Last major version release was in Nov 2019 and did not change package names.)

The README suggests as a feature:

Support of shading to another package for locking on a particular released version

We may want to go this route in the future, but for now, I'm adding the integration as an optional jar primarily for use by non-libraries at the absolute root of the dependency tree.

TODO

parse floats less horribly
update readme

htmldoug · 2022-03-09T05:48:48Z

.../src/main/scala/com/rallyhealth/weejson/v1/wee_jsoniter_scala/WeePickleJsonValueCodecs.scala

+        val cs = new String(in.readRawValAsBytes(), StandardCharsets.US_ASCII)
+
+        /**
+          * This regex performs rather badly, but gets the tests passing.
+          *
+          * We're looking for a value we can pass through the Visitor interface--
+          * either a primitive Double or a CharSequence representing a *valid*
+          * number conforming to https://datatracker.ietf.org/doc/html/rfc7159#page-6.
+          *
+          * `in.readRawValAsBytes()` does NOT do that validation. It will happily
+          * return a String of "------".
+          *
+          * `in.readBigDecimal(null).toString` is tempting, but will not provide the raw input.
+          * Instead, it transforms the input from "0.00000001" to "1.0E-8".
+          * This fails roundtrip tests.
+          *
+          * I tried combining the two approaches, `in.readBigDecimal(null)` for validation,
+          * then `in.rollbackToMark()` + `in.readRawValAsBytes()` to capture the raw input,
+          * but for a value like "1.0-----", `in.readBigDecimal(null)` will read "1.0",
+          * then `in.readRawValAsBytes()` will return the whole string, including the unwanted
+          * trailing hyphens.
+          *
+          */
+        require(ValidJsonNum.pattern.matcher(cs).matches(), "invalid number")
+        v.visitFloat64String(cs)


@plokhotnyuk, I seem to be stuck here. Do you have any suggestions to achieve better performance than my hacky regex given the failed attempts in the comment above? Specifically, is there anything in JsonReader that can help me that I might have overlooked?

@htmldoug You can use some counters to calculate number of valid chars like in the following code snippet (BEWARE: it is not tested)

private def parseNumber[J]( in: JsonReader, v: Visitor[_, J] ): J = { in.setMark() var b = in.nextByte() var digits, index = 0 var decIndex, expIndex = -1 if (b == '-') { b = in.nextByte() index += 1 } try { digits -= index while (b >= '0' && b <= '9') { b = in.nextByte() index += 1 } digits += index if (b == '.') { decIndex = index b = in.nextByte() index += 1 } digits -= index while (b >= '0' && b <= '9') { b = in.nextByte() index += 1 } digits += index if ((b | 0x20) == 'e') { expIndex = index b = in.nextByte() index += 1 if (b == '-' || b == '+') { b = in.nextByte() index += 1 } while (b >= '0' && b <= '9') { b = in.nextByte() index += 1 } } } catch { case _: JsonReaderException => // ignore the end of input error for now } finally in.rollbackToMark() if ((decIndex & expIndex) == -1) { if (digits < 19) v.visitInt64(in.readLong()) else { val x = in.readBigInt(null) if (x.bitLength < 64) v.visitInt64(x.longValue) else v.visitFloat64StringParts(x.toString, -1, -1) } } else { val cs = new String(in.readRawValAsBytes(), StandardCharsets.US_ASCII) require(cs.length - 1 == index, "invalid number") v.visitFloat64String(cs) } }

Ooh, comparing the walked length vs the string.length is smart. I'll run with this implementation tomorrow. Thank you, @plokhotnyuk!

I tried something similar previously, but it wasn't giving me the throughput I expected. I just realized that most of the time is spent in endOfInputError > appendHexDump. "3.14 " parses 3x faster than "3.14"! I wonder if something similar could also be happening in BigDecimalReading, BigIntReading, etc.

@htmldoug You can parse an array of numbers to make your benchmark more realistic. I bet nobody use JSON parsers to read a single number.

plokhotnyuk · 2023-03-07T10:48:22Z

@htmldoug I've added a missing method for checking of remaining bytes in the reader, you can now use it to parse numbers without need to catch end of input errors, like here.

htmldoug force-pushed the v1-jsoniter-scala branch 3 times, most recently from 060184f to 4ff4710 Compare December 19, 2021 07:57

htmldoug force-pushed the v1-jsoniter-scala branch from 4ff4710 to 2e69f1b Compare December 28, 2021 01:42

htmldoug added 5 commits March 8, 2022 16:25

Add weejson-jsoniter-scala

52a40a2

Add passing parser tests.

a5b0d85

More tests, reject invalid numbers

fa87197

Simplify, but fails on invalid numbers

be6696d

Add regex to the parser to make the tests pass.

01f88f0

htmldoug force-pushed the v1-jsoniter-scala branch from 2e69f1b to 01f88f0 Compare March 9, 2022 05:14

htmldoug added 2 commits March 9, 2022 00:22

clean up formatting.

e99a975

scala 2.11 support

4d82ff5

htmldoug force-pushed the v1-jsoniter-scala branch from 515052e to 4d82ff5 Compare March 9, 2022 05:27

scala 2.13 support.

a12c795

htmldoug marked this pull request as ready for review March 9, 2022 05:41

htmldoug commented Mar 9, 2022

View reviewed changes

Add parseNumberCounter. numbersoup test fails

a3d913d

htmldoug force-pushed the v1-jsoniter-scala branch from b649e46 to c66c734 Compare March 25, 2022 22:27

Remove regex. Tests pass

be795ee

htmldoug force-pushed the v1-jsoniter-scala branch from c66c734 to be795ee Compare March 25, 2022 22:32

htmldoug added 2 commits March 25, 2022 18:39

Disable hex dumps

434209e

Handle JsonWriterVisitor.visitBinary: offset, length

7ce2017

plokhotnyuk mentioned this pull request Mar 17, 2023

Performance Optimizations com-lihaoyi/upickle#467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add weejson-jsoniter-scala #105

Add weejson-jsoniter-scala #105

htmldoug commented Dec 19, 2021 •

edited

Loading

htmldoug Mar 9, 2022

plokhotnyuk Mar 9, 2022 •

edited

Loading

htmldoug Mar 9, 2022

plokhotnyuk Mar 9, 2022

plokhotnyuk commented Mar 7, 2023

Add weejson-jsoniter-scala #105

Are you sure you want to change the base?

Add weejson-jsoniter-scala #105

Conversation

htmldoug commented Dec 19, 2021 • edited Loading

Summary

Benchmarks

Optional Jar

TODO

htmldoug Mar 9, 2022

Choose a reason for hiding this comment

plokhotnyuk Mar 9, 2022 • edited Loading

Choose a reason for hiding this comment

htmldoug Mar 9, 2022

Choose a reason for hiding this comment

plokhotnyuk Mar 9, 2022

Choose a reason for hiding this comment

plokhotnyuk commented Mar 7, 2023

htmldoug commented Dec 19, 2021 •

edited

Loading

plokhotnyuk Mar 9, 2022 •

edited

Loading