Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading larger JSON arrays #2677

Closed
philrz opened this issue May 6, 2021 · 1 comment · Fixed by #3123
Closed

Reading larger JSON arrays #2677

philrz opened this issue May 6, 2021 · 1 comment · Fixed by #3123
Assignees

Comments

@philrz
Copy link
Contributor

philrz commented May 6, 2021

While drafting the article for brimdata/brimcap#72 I easily bumped into the limits of the JSON reader that was introduced in #2573.

Using the nfdump toolset, I created some NetFlow records based on my favorite 500 MB wrccdc pcap. Its options include a -o json that outputs an array of JSON objects, but there's no NDJSON option. This ultimately led to the creation of the attached netflow.json.gz, which uncompresses to over 31 MB, which is enough to exceed the limits as of Zed commit dc82704b.

$ nfpcapd -r ~/wrccdc.pcap -l .

$ ls -l
total 90264
-rw-r--r--  1 phil  staff   5315472 May  5 18:04 nfcapd.201803231255
-rw-r--r--  1 phil  staff  40586076 May  5 18:04 nfcapd.201803231300

$ nfdump -r nfcapd.201803231255 -o json > netflow.json

$ zq -z -i json 'count()' netflow.json 
netflow.json: JSON input buffer size exceeded: 26214400

As mentioned above, the JSON is an array of objects.

[
{
        "type" : "FLOW",
        "sampled" : 0,
        "export_sysid" : 0,
        "t_first" : "2018-03-23T12:58:22.641",
        "t_last" : "2018-03-23T12:58:22.641",
        "proto" : 17,
        "src4_addr" : "10.0.0.100",
        "dst4_addr" : "10.47.2.154",
        "src_port" : 53,
        "dst_port" : 58331,
        "fwd_status" : 0,
        "tcp_flags" : "........",
        "src_tos" : 0,
        "in_packets" : 1,
        "in_bytes" : 313,
        "cli_latency" : 0.000000,
        "srv_latency" : 0.000000,
        "app_latency" : 0.000000,
        "label" : "<none>"
}
,
{
        "type" : "FLOW",
        "sampled" : 0,
        "export_sysid" : 0,
        "t_first" : "2018-03-23T12:58:22.641",
        "t_last" : "2018-03-23T12:58:22.641",
        "proto" : 17,
        "src4_addr" : "10.0.0.100",
        "dst4_addr" : "10.47.2.154",
        "src_port" : 53,
        "dst_port" : 58331,
        "fwd_status" : 0,
        "tcp_flags" : "........",
        "src_tos" : 0,
        "in_packets" : 1,
        "in_bytes" : 313,
        "cli_latency" : 0.000000,
        "srv_latency" : 0.000000,
        "app_latency" : 0.000000,
        "label" : "<none>"
}
,
...
]

@mccanne noted a couple ways we could go about dealing with this better:

  1. We could allow for a configurable limit that could be raised to accept bigger inputs.
  2. The reader could recognize when it's seeing an array of objects like this and read in a more stream-like fashion.

Given the care we've taken to minimize hard ceilings in the tools, I'd lean toward the second option.

@philrz philrz added this to the Data MVP1 milestone May 12, 2021
@nwt nwt closed this as completed in #3123 Sep 29, 2021
@nwt nwt self-assigned this Sep 29, 2021
brim-bot pushed a commit to brimdata/brimcap that referenced this issue Sep 29, 2021
…o" by nwt

This is an auto-generated commit with a Zed dependency update. The Zed PR
brimdata/super#3123, authored by @nwt,
has been merged.

decode top-level array incrementally in zio/jsonio

Reading a large, top-level JSON array with zio/jsonio.Reader is
impractical because it decodes the full input on the first call to its
Read method.  Decode top-level arrays incrementally instead.

Closes brimdata/super#2677.
brim-bot pushed a commit to brimdata/brimcap that referenced this issue Sep 29, 2021
…o" by nwt

This is an auto-generated commit with a Zed dependency update. The Zed PR
brimdata/super#3123, authored by @nwt,
has been merged.

decode top-level array incrementally in zio/jsonio

Reading a large, top-level JSON array with zio/jsonio.Reader is
impractical because it decodes the full input on the first call to its
Read method.  Decode top-level arrays incrementally instead.

Closes brimdata/super#2677.
brim-bot pushed a commit to brimdata/zui that referenced this issue Sep 29, 2021
…o" by nwt

This is an auto-generated commit with a Zed dependency update. The Zed PR
brimdata/super#3123, authored by @nwt,
has been merged.

decode top-level array incrementally in zio/jsonio

Reading a large, top-level JSON array with zio/jsonio.Reader is
impractical because it decodes the full input on the first call to its
Read method.  Decode top-level arrays incrementally instead.

Closes brimdata/super#2677.
@philrz
Copy link
Contributor Author

philrz commented Sep 29, 2021

Verified in Zed commit df490e3.

Repeating the original repro steps, now the entire JSON array can be successfully read without any buffer-related errors.

$ zq -version
Version: v0.30.0-54-gdf490e38

$ zq -z -i json 'count()' netflow.json
{count:73821(uint64)}

Thanks @nwt!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants