Zed lakes should handle partition keys other than "ts" #2482

philrz · 2021-04-01T21:12:34Z

In the Zeek-centric era, Zed lakes were assumed to always be keyed by a field of type time called ts. For the future where all generic data will be welcomed, lakes need the ability to handle keys of other data types and named fields.

(@mccanne recently clarified for me that while the Zed commands currently give the impression that this is already possible [e.g. zed lake create -p logs -k otherkey -order asc responds as if it succeeded], but the lake itself still is wired to expect the key to be ts and sort by it. So the alternate-keyed data in this example would get treated as if every record had ts=0 and so it's effectively unsorted when you try to query the lake for it. So we definitely have work left to to to scale it, provide the same zoom-in/zoom-out experience we've traditionally had with time, etc.)

The text was updated successfully, but these errors were encountered:

@mccanne

This is an auto-generated commit with a Zed dependency update. The Zed PR brimdata/super#2729, authored by @mccanne, has been merged. add support for arbitrary pool keys The backend previously presumed the pool key was of type time. This commit generalizes the scanning logic to allow pool keys of any type. The seek index has been simplified to use a simple, flat index that is no longer a "micro index". Instead it is a simple, single-level list of keys that is used at open to get the scan range for a row object when the scan is smaller than the whole object. This design presumes the index will be cached when running multiple queries over a sub-range within a row object. While updating tests, we noticed the segment size and row_size fields were virtually the same so we deleted the size field. As part of this commit, the index package was updated to use zson format for key parsing instead of the deprecated tzng format, so we removed zio/tzngio/builder.go. Closes brimdata/super#2482

@mccanne

This is an auto-generated commit with a Zed dependency update. The Zed PR brimdata/super#2729, authored by @mccanne, has been merged. add support for arbitrary pool keys The backend previously presumed the pool key was of type time. This commit generalizes the scanning logic to allow pool keys of any type. The seek index has been simplified to use a simple, flat index that is no longer a "micro index". Instead it is a simple, single-level list of keys that is used at open to get the scan range for a row object when the scan is smaller than the whole object. This design presumes the index will be cached when running multiple queries over a sub-range within a row object. While updating tests, we noticed the segment size and row_size fields were virtually the same so we deleted the size field. As part of this commit, the index package was updated to use zson format for key parsing instead of the deprecated tzng format, so we removed zio/tzngio/builder.go. Closes brimdata/super#2482

philrz · 2021-05-28T16:24:53Z

Verified using Brim commit 4b79356 which uses Zed commit ff3d2e2.

For the security-centric use cases for which Brim/Zed have been largely used to date, one of the big benefits of pools with non-timestamp keys is being able to "join" primary log data (such as Zeek/Suricata) with additional data sources (such as threat intel or domain info), where the latter is not time-based. Being able to join in Zed is predicated on the two sources being both sorted by the join key, so being able to store these additional data sources pre-sorted by the join-ready key brings significant efficiency gains.

This verification example leverages JA3 data for gleaning information about encrypted SSL/TLS sessions. In addition to the generation of the JA3 hashes using Zeek (which the Zeek embedded with the Brim app does out-of-the-box), here we also bring in the additional "List of all user-agents" data source that can be downloaded from https://ja3er.com/getAllUasJson. This gives information on user agents associated with each JA3 MD5 hash. So since the Zeek ssl records only have the JA3 hash, the example will store this user-friendly user agent data in a separate pool, then join it with the ssl records to decorate them directly with the user agent info.

Before focusing on the user agent data, I've launched a Dev run of Brim commit 4b79356 and imported my favorite 500 MB wrccdc pcap into it, which has the effect of creating a ts-sorted pool called wrccdc.pcap that include the Zeek ssl records with the ja3 fields populated with hashes.

While Brim is already using the Zed Lake for storage, it's not yet natively+fully able to handle these pools with non-ts keys, so we'll do our joins using the Zed CLI tooling outside of the app and then push the decorated data back in. To prepare for this, at the shell I'll set things up to work from the running app's Zed Lake root. At this point we can see the Pool that was automatically created by Brim when I imported my pcap.

$ export ZED_LAKE_ROOT="$HOME/work/brim/run/data/lake"

$ zed lake ls
wrccdc.pcap 1tAhx0sIKoOHvIZfaYjm1rEHUJH key ts order desc

It turns out there's a little prep of the user agent data that has to happen. This is unrelated to the specific topic of non-timestamp partition keys, but is an opportunity to show off some other Zed features or remind us of some items on the Zed to-do list. Specifically:

The data source starts life as a giant JSON array, not NDJSON. Whereas Zed can read arbitrarily large amounts of NDJSON input, with "regular" JSON, buffer limitations currently limit us to smaller files (Reading larger JSON arrays #2677) so we'll preprocess it into NDJSON with jq to get around this.
There's multiple user agents associated with the same hash, so we'll use Zed's union() aggregate function to collect up the multiple entries into sets.
The combined length of many of the user agent strings and the number of user agents per hash makes the union() outputs too large. Because sets this huge might be not be very user-friendly anyway, we'll take the approach here of just isolating the first word of each user agent string. We've discussed the topic of whether we can somehow handle arbitrarily large rows, and per-row size limits on aggregators #1813 tracks some of the most recent thoughts.
The field called User-Agent in the original JSON would pose a problem if entered "bare" in our Zed, since it would be interpreted as an expression that attempts to subtract the value in a field called Agent from the value in a field called User. Therefore we use the .["<field-name>"] syntax that lets us reference field names that contain special characters.
Recent testing has revealed a successful Zed join requires not only matching join key values but that their types must also be the same. Since the ja3 field in the Zeek ssl records happen to be of the bstring type, in our preprocessing we'll cast the matching md5 field from the user agent source to also be bstring, rather than leaving it as the string field it would have otherwise defaulted to. See Equality comparison in join should work between comparable types #2779 for more on this topic.

Having said all that, here's the creation of the Pool followed by the preprocessing of the user agent data source and its zed lake oad into that Pool:

$ zed lake create -p useragents -orderby md5:asc
pool created: useragents

$ jq -c '.[]' getAllUasJson | zq 'put useragent:=split(.["User-Agent"], " ")[0] | useragents:=union(useragent) by md5:=bstring(md5)' - | zed lake load -p useragents -
1tAi8KeHxaAl5JIM5brpJkEvgOd committed 1 segments

$ zed lake ls
wrccdc.pcap 1tAhx0sIKoOHvIZfaYjm1rEHUJH key ts order desc
useragents 1tAi604XbFnmCgXQBlxameYWxeL key md5 order asc

Note that we never did any explicit sort in our preprocessing with jq or zq: The Zed Lake handled this for us thanks to the -orderby md5:asc config.

Now we'll create a new Pool to hold our decorated ssl records, join our two data sources, then push the decorated data back into that new Pool. Note that while we had to sort our Zeek data by ja3 (since its default sort order is ts timestamp), we didn't need to do the same with our user agent data since the Pool gives it back to us in the correct order.

$ cat join-ua.zed 
from (
  'wrccdc.pcap' => _path=='ssl' | sort ja3;
  useragents
) | join on ja3=md5 useragents

$ zed lake create -p withagents -orderby ts:desc
pool created: withagents

$ zed lake query -I join-ua.zed | zed lake load -p withagents -
1tAiSyU0EoJbDCFobMbRLbJ5l0v committed 1 segments

(Note that #2765 tracks our intent to introduce a load operator directly within Zed, which would have allowed us to perform that last operation in one shot rather than a two-step pipeline.)

Because Brim still needs some additional enhancements in order to see these Pools that have been loaded from outside the app, I had to select Window > Reset State to get it to see the new Pool I just created. But once it's visible, I could "scroll right" and see this user agent data now at the tail end of each of my ssl records.

Thanks @mccanne!

philrz added this to the Data MVP0 milestone Apr 1, 2021

philrz assigned mccanne Apr 1, 2021

philrz mentioned this issue May 6, 2021

handle multiple pool keys #2657

Open

mccanne mentioned this issue May 17, 2021

add support for arbitrary pool keys #2729

Merged

mccanne closed this as completed in #2729 May 19, 2021

github-actions bot mentioned this issue May 19, 2021

Zed update through "add support for arbitrary pool keys" by mccanne brimdata/zui#1658

Closed

philrz mentioned this issue May 20, 2021

"zed query" not returning results in key order for "non-ts" Pool #2749

Closed

philrz linked a pull request May 21, 2021 that will close this issue

fix bug in lake scan when merging by non-ts pool key #2752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zed lakes should handle partition keys other than "ts" #2482

Zed lakes should handle partition keys other than "ts" #2482

philrz commented Apr 1, 2021 •

edited

Loading

philrz commented May 28, 2021 •

edited

Loading

Zed lakes should handle partition keys other than "ts" #2482

Zed lakes should handle partition keys other than "ts" #2482

Comments

philrz commented Apr 1, 2021 • edited Loading

philrz commented May 28, 2021 • edited Loading

philrz commented Apr 1, 2021 •

edited

Loading

philrz commented May 28, 2021 •

edited

Loading