Reading larger JSON arrays #2677

philrz · 2021-05-06T01:18:27Z

While drafting the article for brimdata/brimcap#72 I easily bumped into the limits of the JSON reader that was introduced in #2573.

Using the nfdump toolset, I created some NetFlow records based on my favorite 500 MB wrccdc pcap. Its options include a -o json that outputs an array of JSON objects, but there's no NDJSON option. This ultimately led to the creation of the attached netflow.json.gz, which uncompresses to over 31 MB, which is enough to exceed the limits as of Zed commit dc82704b.

$ nfpcapd -r ~/wrccdc.pcap -l .

$ ls -l
total 90264
-rw-r--r--  1 phil  staff   5315472 May  5 18:04 nfcapd.201803231255
-rw-r--r--  1 phil  staff  40586076 May  5 18:04 nfcapd.201803231300

$ nfdump -r nfcapd.201803231255 -o json > netflow.json

$ zq -z -i json 'count()' netflow.json 
netflow.json: JSON input buffer size exceeded: 26214400

As mentioned above, the JSON is an array of objects.

[
{
        "type" : "FLOW",
        "sampled" : 0,
        "export_sysid" : 0,
        "t_first" : "2018-03-23T12:58:22.641",
        "t_last" : "2018-03-23T12:58:22.641",
        "proto" : 17,
        "src4_addr" : "10.0.0.100",
        "dst4_addr" : "10.47.2.154",
        "src_port" : 53,
        "dst_port" : 58331,
        "fwd_status" : 0,
        "tcp_flags" : "........",
        "src_tos" : 0,
        "in_packets" : 1,
        "in_bytes" : 313,
        "cli_latency" : 0.000000,
        "srv_latency" : 0.000000,
        "app_latency" : 0.000000,
        "label" : "<none>"
}
,
{
        "type" : "FLOW",
        "sampled" : 0,
        "export_sysid" : 0,
        "t_first" : "2018-03-23T12:58:22.641",
        "t_last" : "2018-03-23T12:58:22.641",
        "proto" : 17,
        "src4_addr" : "10.0.0.100",
        "dst4_addr" : "10.47.2.154",
        "src_port" : 53,
        "dst_port" : 58331,
        "fwd_status" : 0,
        "tcp_flags" : "........",
        "src_tos" : 0,
        "in_packets" : 1,
        "in_bytes" : 313,
        "cli_latency" : 0.000000,
        "srv_latency" : 0.000000,
        "app_latency" : 0.000000,
        "label" : "<none>"
}
,
...
]

@mccanne noted a couple ways we could go about dealing with this better:

We could allow for a configurable limit that could be raised to accept bigger inputs.
The reader could recognize when it's seeing an array of objects like this and read in a more stream-like fashion.

Given the care we've taken to minimize hard ceilings in the tools, I'd lean toward the second option.

The text was updated successfully, but these errors were encountered:

@nwt

…o" by nwt This is an auto-generated commit with a Zed dependency update. The Zed PR brimdata/super#3123, authored by @nwt, has been merged. decode top-level array incrementally in zio/jsonio Reading a large, top-level JSON array with zio/jsonio.Reader is impractical because it decodes the full input on the first call to its Read method. Decode top-level arrays incrementally instead. Closes brimdata/super#2677.

@nwt

…o" by nwt This is an auto-generated commit with a Zed dependency update. The Zed PR brimdata/super#3123, authored by @nwt, has been merged. decode top-level array incrementally in zio/jsonio Reading a large, top-level JSON array with zio/jsonio.Reader is impractical because it decodes the full input on the first call to its Read method. Decode top-level arrays incrementally instead. Closes brimdata/super#2677.

@nwt

…o" by nwt This is an auto-generated commit with a Zed dependency update. The Zed PR brimdata/super#3123, authored by @nwt, has been merged. decode top-level array incrementally in zio/jsonio Reading a large, top-level JSON array with zio/jsonio.Reader is impractical because it decodes the full input on the first call to its Read method. Decode top-level arrays incrementally instead. Closes brimdata/super#2677.

philrz · 2021-09-29T21:36:03Z

Verified in Zed commit df490e3.

Repeating the original repro steps, now the entire JSON array can be successfully read without any buffer-related errors.

$ zq -version
Version: v0.30.0-54-gdf490e38

$ zq -z -i json 'count()' netflow.json
{count:73821(uint64)}

Thanks @nwt!

philrz mentioned this issue May 6, 2021

Wiki article and example YAML/scripts for custom Zeek/Suricata/NetFlow brimdata/brimcap#72

Merged

philrz added this to the Data MVP1 milestone May 12, 2021

philrz mentioned this issue May 28, 2021

Zed lakes should handle partition keys other than "ts" #2482

Closed

nwt mentioned this issue Sep 29, 2021

decode top-level array incrementally in zio/jsonio #3123

Merged

nwt closed this as completed in #3123 Sep 29, 2021

nwt self-assigned this Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading larger JSON arrays #2677

Reading larger JSON arrays #2677

philrz commented May 6, 2021

philrz commented Sep 29, 2021

Reading larger JSON arrays #2677

Reading larger JSON arrays #2677

Comments

philrz commented May 6, 2021

philrz commented Sep 29, 2021