Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling JSON-compatible pseudo-maps #4565

Open
philrz opened this issue Apr 30, 2023 · 1 comment · Fixed by #4589
Open

Handling JSON-compatible pseudo-maps #4565

philrz opened this issue Apr 30, 2023 · 1 comment · Fixed by #4589

Comments

@philrz
Copy link
Contributor

philrz commented Apr 30, 2023

Repro is with Zed commit 4ffdf3e.

Consider again the jq example with test data scan.json.gz shown at the top of #4555 and the Zed that tries to come close to it.

$ jq --version
jq-1.6

$ jq 'reduce .[] as $e ({}; . + { ($e.ip): (.[$e.ip] + $e.ports) })' scan.json
{
  "192.168.5.53": [
    {
      "port": 515,
      "proto": "tcp",
      "status": "open",
      "reason": "syn-ack",
      "ttl": 64
    },
    {
      "port": 21,
      "proto": "tcp",
      "status": "open",
      "reason": "syn-ack",
      "ttl": 64
    }
  ],
  "192.168.5.49": [
    {
      "port": 49152,
      "proto": "tcp",
      "status": "open",
      "reason": "syn-ack",
      "ttl": 64
    },
    {
      "port": 3401,
      "proto": "tcp",
      "status": "open",
      "reason": "syn-ack",
      "ttl": 64
    }
  ]
}

$ zq -version
Version: v1.7.0-55-g4ffdf3ef

$ zq -Z 'over this | ports:=collect(ports[0]) by ip | unflatten([{key:ip,value:ports}])' scan.json
{
    "192.168.5.53": [
        {
            port: 515,
            proto: "tcp",
            status: "open",
            reason: "syn-ack",
            ttl: 64
        },
        {
            port: 21,
            proto: "tcp",
            status: "open",
            reason: "syn-ack",
            ttl: 64
        }
    ]
}
{
    "192.168.5.49": [
        {
            port: 49152,
            proto: "tcp",
            status: "open",
            reason: "syn-ack",
            ttl: 64
        },
        {
            port: 3401,
            proto: "tcp",
            status: "open",
            reason: "syn-ack",
            ttl: 64
        }
    ]
}

A subtle difference is that the zq output is two separate records while the jq output is a single JSON object. There's nothing about the JSON object output by jq that says it must ultimately be used like a map, but docs like Go's json.Marshal() (and the equivalent in other languages) set expectations about bouncing between this JSON representation and a true map within a language.

Imagining I was a user insistent on matching the jq output, I struggled to easily replicate the same output with zq. Zed's collect() and union() will consolidate values, but they produce a record or set, respectively. Zed's map type seems the likely match, but once it's assembled and output as JSON, it comes out with named key and value fields which once again is a mismatch from what we saw with jq.

$ zq -Z 'over this | ports:=collect(ports[0]) by ip' scan.json | zq -j 'collect_map(|{ip:ports}|)' - | zq -Z -
[
    {
        key: "192.168.5.49",
        value: [
            {
                port: 49152,
                proto: "tcp",
                status: "open",
                reason: "syn-ack",
                ttl: 64
            },
            {
                port: 3401,
                proto: "tcp",
                status: "open",
                reason: "syn-ack",
                ttl: 64
            }
        ]
    },
    {
        key: "192.168.5.53",
        value: [
            {
                port: 515,
                proto: "tcp",
                status: "open",
                reason: "syn-ack",
                ttl: 64
            },
            {
                port: 21,
                proto: "tcp",
                status: "open",
                reason: "syn-ack",
                ttl: 64
            }
        ]
    }
]

The team did eventually come up with this approach that matches the jq output, but it's probably not something we can expect our current user base to come up with on their own. (Output has been pretty-printed for readability.)

$ zq -j 'over this | value:=collect(ports[0]) by key:=ip | collect(this) | unflatten(this)' scan.json
{
  "192.168.5.53": [
    {
      "port": 515,
      "proto": "tcp",
      "status": "open",
      "reason": "syn-ack",
      "ttl": 64
    },
    {
      "port": 21,
      "proto": "tcp",
      "status": "open",
      "reason": "syn-ack",
      "ttl": 64
    }
  ],
  "192.168.5.49": [
    {
      "port": 49152,
      "proto": "tcp",
      "status": "open",
      "reason": "syn-ack",
      "ttl": 64
    },
    {
      "port": 3401,
      "proto": "tcp",
      "status": "open",
      "reason": "syn-ack",
      "ttl": 64
    }
  ]
}

Unless I'm missing something obvious, it does seem like we could stand to improve here. A few ideas:

  1. Instead of always using the key / value representation, perhaps the JSON output of a Zed map could instead be in this pseudo-map object form. If there's something about the data that prevents it from being output that way (e.g., keys that aren't simple strings and can't be easily cast into them) that could surface an error so the user knows they'd need to transform the data. (Update: Since the opening of this issue, Improve JSON output for Zed maps #4589 has covered this.)

  2. If we offer that and then a user is reliant on having access to the key / value representation in order to do their transforms, perhaps the flatten() function could be enhanced to accept map types as input, or have a separate function to handle this flattening of maps.

  3. Once it becomes easier to write this kind of output as JSON, I imagine we'll want ways to easily read it back in as map types. At the moment the way I can see to turn such a pseudo-map back into a Zed map is over this | collect_map(|{key[0]:value}|), but to make this easier perhaps the collect_map() aggregate function could be enhanced to directly accept a record as input with handling, similar to Go's json.Unmarshal().

@philrz
Copy link
Contributor Author

philrz commented Dec 13, 2024

I recently bumped into another variation of this while helping a user in a community Slack thread. Their specific example:

optimization question - every step here makes sense to me as I teased it out, but it feels a bit too much - like maybe I could do this in 3 steps instead of 6?

❯ echo '[{"out":"90"},{"in":"561"},{"in":"306"},{"out":"874"},{"out":"26"},{"out":"1020"}]' |
 zq 'over this
     | flatten(this)
     | yield this[0]
     | value:=collect(value) by key
     | collect(this)
     | unflatten(this)' -

{in:["561","306"],out:["90","874","26","1020"]}

it’s having to flatten it so I can do “collect(value) by key” that seems like a chore I think

This shows how they had to piece together the same kind of over + collect + unflatten combinations like those shown earlier in this issue in order to get their final result of a single record with values collected under different keys, i.e., what I've been calling a "pseudo-map".

I did come up with this one alternative that was fewer steps, though maybe not much less code.

$ zq -version
Version: v1.18.0

$ echo '[{"out":"90"},{"in":"561"},{"in":"306"},{"out":"874"},{"out":"26"},{"out":"1020"}]' | zq 'over this
    | flatten(this)
    | in:=collect(this[0].value) where this[0].key==["in"],
      out:=collect(this[0].value) where this[0].key==["out"]' -

{in:["561","306"],out:["90","874","26","1020"]}

We had some more chat about whether it would make sense to have functionality to achieve this more directly, e.g., as the user put it:

I guess I was hoping for a syntax where I could refer to “whatever-the-key-is”

In thinking it over, part of why this doesn't show up so directly even in other languages like Python is because data like this user's that's a record with a single key/value pair is basically a simplification of the general case. A record in Zed (or a dict in Python, an object in JavaScript, etc.) can potentially have many key value pairs, hence why the built-in approaches often take the form of "give me all the keys, and I'll iterate through them to filter out the one I want". But it can still be true that a more direct way to handle the simple case would be handy, even if it was just something delivered via a module (#2599) when that becomes an option.

In thinking about user-defined functionality that could fit into a module, I proposed this alternative:

$ echo '[{"out":"90"},{"in":"561"},{"in":"306"},{"out":"874"},{"out":"26"},{"out":"1020"}]' | zq -z '
func key(r): (
  flatten(r)[0].key[0]
)
func val(r): (
  flatten(r)[0].value
)
over this
| values:=collect(val(this)) by k:=key(this)
| cut this[k]:=values' -

{out:["90","874","26","1020"]}
{in:["561","306"]}

And if it needed to be in a single value, add collect_map.

$ echo '[{"out":"90"},{"in":"561"},{"in":"306"},{"out":"874"},{"out":"26"},{"out":"1020"}]' | zq -z '
func key(r): (
  flatten(r)[0].key[0]
)
func val(r): (
  flatten(r)[0].value
)
over this
| values:=collect(val(this)) by k:=key(this)
| cut this[k]:=values
| collect_map(|{key(this):val(this)}|)' -

|{"in":["561","306"],"out":["90","874","26","1020"]}|

And if you truly wanted just a JSON-style record/object like we had before, you could use -j to output as JSON which has the side effect of turning the map into an object, since JSON doesn't have the map concept.

$ echo '[{"out":"90"},{"in":"561"},{"in":"306"},{"out":"874"},{"out":"26"},{"out":"1020"}]' | zq -j '
func key(r): (
  flatten(r)[0].key[0]
)
func val(r): (
  flatten(r)[0].value
)
over this
| values:=collect(val(this)) by k:=key(this)
| cut this[k]:=values
| collect_map(|{key(this):val(this)}|)' -

{"in":["561","306"],"out":["90","874","26","1020"]}

Part of why I fiddled with a map is because a straight collect() gives us an array of records rather than a single record with two keys each pointing at an array value. That might be fine if you're just looking at the output with your eyes, but for the sake of the exercise I'm assuming getting to specific output format is important (i.e., looking to see if the language can cover it).

$ echo '[{"out":"90"},{"in":"561"},{"in":"306"},{"out":"874"},{"out":"26"},{"out":"1020"}]' | zq -z '
func key(r): (
  flatten(r)[0].key[0]
)
func val(r): (
  flatten(r)[0].value
)
over this
| values:=collect(val(this)) by k:=key(this)
| cut this[k]:=values
| collect(this)' -

[{out:["90","874","26","1020"]},{in:["561","306"]}]

If you happen to know the number of by groupings you're expecting, I see we can get to the "single record" like this:

$ echo '[{"out":"90"},{"in":"561"},{"in":"306"},{"out":"874"},{"out":"26"},{"out":"1020"}]' | zq -z '
func key(r): (
  flatten(r)[0].key[0]
)
func val(r): (
  flatten(r)[0].value
)
over this
| values:=collect(val(this)) by k:=key(this)
| cut this[k]:=values
| collect(this)   
| yield {...this[0], ...this[1]}' -

{out:["90","874","26","1020"],in:["561","306"]}

But I don't see a quick way for the general case, i.e. no way to ... spread out "every record in this array".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant