Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zed lakes should handle partition keys other than "ts" #2482

Closed
philrz opened this issue Apr 1, 2021 · 1 comment · Fixed by #2729 or #2752
Closed

Zed lakes should handle partition keys other than "ts" #2482

philrz opened this issue Apr 1, 2021 · 1 comment · Fixed by #2729 or #2752
Assignees
Milestone

Comments

@philrz
Copy link
Contributor

philrz commented Apr 1, 2021

In the Zeek-centric era, Zed lakes were assumed to always be keyed by a field of type time called ts. For the future where all generic data will be welcomed, lakes need the ability to handle keys of other data types and named fields.

(@mccanne recently clarified for me that while the Zed commands currently give the impression that this is already possible [e.g. zed lake create -p logs -k otherkey -order asc responds as if it succeeded], but the lake itself still is wired to expect the key to be ts and sort by it. So the alternate-keyed data in this example would get treated as if every record had ts=0 and so it's effectively unsorted when you try to query the lake for it. So we definitely have work left to to to scale it, provide the same zoom-in/zoom-out experience we've traditionally had with time, etc.)

@philrz philrz added this to the Data MVP0 milestone Apr 1, 2021
brim-bot pushed a commit to brimdata/brimcap that referenced this issue May 19, 2021
This is an auto-generated commit with a Zed dependency update. The Zed PR
brimdata/super#2729, authored by @mccanne,
has been merged.

add support for arbitrary pool keys

The backend previously presumed the pool key was of type time.
This commit generalizes the scanning logic to allow pool keys of any type.

The seek index has been simplified to use a simple, flat
index that is no longer a "micro index".  Instead it is a simple,
single-level list of keys that is used at open to get the
scan range for a row object when the scan is smaller than the
whole object.  This design presumes the index will be cached
when running multiple queries over a sub-range within a row object.

While updating tests, we noticed the segment size and row_size
fields were virtually the same so we deleted the size field.

As part of this commit, the index package was updated to use
zson format for key parsing instead of the deprecated tzng format,
so we removed zio/tzngio/builder.go.

Closes brimdata/super#2482
brim-bot pushed a commit to brimdata/brimcap that referenced this issue May 19, 2021
This is an auto-generated commit with a Zed dependency update. The Zed PR
brimdata/super#2729, authored by @mccanne,
has been merged.

add support for arbitrary pool keys

The backend previously presumed the pool key was of type time.
This commit generalizes the scanning logic to allow pool keys of any type.

The seek index has been simplified to use a simple, flat
index that is no longer a "micro index".  Instead it is a simple,
single-level list of keys that is used at open to get the
scan range for a row object when the scan is smaller than the
whole object.  This design presumes the index will be cached
when running multiple queries over a sub-range within a row object.

While updating tests, we noticed the segment size and row_size
fields were virtually the same so we deleted the size field.

As part of this commit, the index package was updated to use
zson format for key parsing instead of the deprecated tzng format,
so we removed zio/tzngio/builder.go.

Closes brimdata/super#2482
@philrz philrz linked a pull request May 21, 2021 that will close this issue
@philrz
Copy link
Contributor Author

philrz commented May 28, 2021

Verified using Brim commit 4b79356 which uses Zed commit ff3d2e2.

For the security-centric use cases for which Brim/Zed have been largely used to date, one of the big benefits of pools with non-timestamp keys is being able to "join" primary log data (such as Zeek/Suricata) with additional data sources (such as threat intel or domain info), where the latter is not time-based. Being able to join in Zed is predicated on the two sources being both sorted by the join key, so being able to store these additional data sources pre-sorted by the join-ready key brings significant efficiency gains.

This verification example leverages JA3 data for gleaning information about encrypted SSL/TLS sessions. In addition to the generation of the JA3 hashes using Zeek (which the Zeek embedded with the Brim app does out-of-the-box), here we also bring in the additional "List of all user-agents" data source that can be downloaded from https://ja3er.com/getAllUasJson. This gives information on user agents associated with each JA3 MD5 hash. So since the Zeek ssl records only have the JA3 hash, the example will store this user-friendly user agent data in a separate pool, then join it with the ssl records to decorate them directly with the user agent info.

Before focusing on the user agent data, I've launched a Dev run of Brim commit 4b79356 and imported my favorite 500 MB wrccdc pcap into it, which has the effect of creating a ts-sorted pool called wrccdc.pcap that include the Zeek ssl records with the ja3 fields populated with hashes.

While Brim is already using the Zed Lake for storage, it's not yet natively+fully able to handle these pools with non-ts keys, so we'll do our joins using the Zed CLI tooling outside of the app and then push the decorated data back in. To prepare for this, at the shell I'll set things up to work from the running app's Zed Lake root. At this point we can see the Pool that was automatically created by Brim when I imported my pcap.

$ export ZED_LAKE_ROOT="$HOME/work/brim/run/data/lake"

$ zed lake ls
wrccdc.pcap 1tAhx0sIKoOHvIZfaYjm1rEHUJH key ts order desc

It turns out there's a little prep of the user agent data that has to happen. This is unrelated to the specific topic of non-timestamp partition keys, but is an opportunity to show off some other Zed features or remind us of some items on the Zed to-do list. Specifically:

  • The data source starts life as a giant JSON array, not NDJSON. Whereas Zed can read arbitrarily large amounts of NDJSON input, with "regular" JSON, buffer limitations currently limit us to smaller files (Reading larger JSON arrays #2677) so we'll preprocess it into NDJSON with jq to get around this.
  • There's multiple user agents associated with the same hash, so we'll use Zed's union() aggregate function to collect up the multiple entries into sets.
  • The combined length of many of the user agent strings and the number of user agents per hash makes the union() outputs too large. Because sets this huge might be not be very user-friendly anyway, we'll take the approach here of just isolating the first word of each user agent string. We've discussed the topic of whether we can somehow handle arbitrarily large rows, and per-row size limits on aggregators #1813 tracks some of the most recent thoughts.
  • The field called User-Agent in the original JSON would pose a problem if entered "bare" in our Zed, since it would be interpreted as an expression that attempts to subtract the value in a field called Agent from the value in a field called User. Therefore we use the .["<field-name>"] syntax that lets us reference field names that contain special characters.
  • Recent testing has revealed a successful Zed join requires not only matching join key values but that their types must also be the same. Since the ja3 field in the Zeek ssl records happen to be of the bstring type, in our preprocessing we'll cast the matching md5 field from the user agent source to also be bstring, rather than leaving it as the string field it would have otherwise defaulted to. See Equality comparison in join should work between comparable types #2779 for more on this topic.

Having said all that, here's the creation of the Pool followed by the preprocessing of the user agent data source and its zed lake oad into that Pool:

$ zed lake create -p useragents -orderby md5:asc
pool created: useragents

$ jq -c '.[]' getAllUasJson | zq 'put useragent:=split(.["User-Agent"], " ")[0] | useragents:=union(useragent) by md5:=bstring(md5)' - | zed lake load -p useragents -
1tAi8KeHxaAl5JIM5brpJkEvgOd committed 1 segments

$ zed lake ls
wrccdc.pcap 1tAhx0sIKoOHvIZfaYjm1rEHUJH key ts order desc
useragents 1tAi604XbFnmCgXQBlxameYWxeL key md5 order asc

Note that we never did any explicit sort in our preprocessing with jq or zq: The Zed Lake handled this for us thanks to the -orderby md5:asc config.

Now we'll create a new Pool to hold our decorated ssl records, join our two data sources, then push the decorated data back into that new Pool. Note that while we had to sort our Zeek data by ja3 (since its default sort order is ts timestamp), we didn't need to do the same with our user agent data since the Pool gives it back to us in the correct order.

$ cat join-ua.zed 
from (
  'wrccdc.pcap' => _path=='ssl' | sort ja3;
  useragents
) | join on ja3=md5 useragents

$ zed lake create -p withagents -orderby ts:desc
pool created: withagents

$ zed lake query -I join-ua.zed | zed lake load -p withagents -
1tAiSyU0EoJbDCFobMbRLbJ5l0v committed 1 segments

(Note that #2765 tracks our intent to introduce a load operator directly within Zed, which would have allowed us to perform that last operation in one shot rather than a two-step pipeline.)

Because Brim still needs some additional enhancements in order to see these Pools that have been loaded from outside the app, I had to select Window > Reset State to get it to see the new Pool I just created. But once it's visible, I could "scroll right" and see this user agent data now at the tail end of each of my ssl records.

Thanks @mccanne!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment