-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Live" data import with Logstash #3151
Comments
As mentioned above. I did a follow-on experiment where I played around with Logstash's Nagle-like configuration options to avoid the abusive one-commit-per-record behavior of the first test. In my first variation, I simply added a setting I then made these two settings in the
Much like the While I don't claim that these are ideal tuning configurations to use in production, running with these settings ultimately had something like the desired effect. I was able to leave it running for several minutes as I browsed web sites and imported 4000 Zeek events that were ultimately spread across only 32 commits. Also, while the memory problems previously identified could surely still be aggravated if this config were left running indefinitely, whereas before I saw the Zed process memory usage skyrocket quickly to 6+ GB, in this case it peaked at a reasonable 114 MB. (*) - Based on the numbers, I don't think the config had quite the effect I was hoping for. It's too suspicious that 4000 / 32 = 125, and 125 is the default setting for |
I recently ran an experiment to see what would happen if I used an external agent (specifically, Logstash) to push data to a Zed lake rather than using
zapi load
. I'll note up front that Logstash is surely not current state-of-the-art in terms of import agents, but it's one with which I'm already familiar, so I figured I'd start there in the interest of time.Overall, it showed that the necessary pieces are all in place, but there were some speedbumps along the way that point to issues that may be worth addressing.
The steps below were performed with Brim commit
5da5e7d
(which uses Zed commit fe1cb75) and Logstash7.15.0
. I also happened to use Zeek4.1.1
running on my local laptop as the "live" data source, just because it was familiar and close at hand. Other than pointing it at the wireless interface on my Mac laptop to sniff "live" traffic, the only other change from Zeek defaults was the addition of the json-streaming-logs package. I also did not do any tuning to Logstash, using just this simple config filelogstash.conf
:In brief, the
input
stanza watches the logs generated by the json-streaming-logs package in atail
-like manner. Thefilter
stanza ensures that the lines in the log file are interpreted as JSON and turned into key/value pairs, then a bunch of Logstash's automatically-added metadata fields are dropped so they won't appear in the output. Finally, theoutput
stanza sends a copy of all the posted data to the console for debug purposes, but most importantly posts data to the Zed API endpoint that accepts data to be loaded to a pool. Thepool_max
setting also turns out to be important because by default Logstash attempts to open 50 connections in parallel to post data in the name of performance, but Zed doesn't care for this, so we've lowered it to just one connection (thanks to @nwt for isolating the importance of this setting!)To put this to use, I first launched Brim, which starts up a Zed lake. Since GA Brim doesn't yet have a way to create empty pools, I used
zapi create
outside the app.This led to my first pause as a user. While cut & pasting the opaque pool ID was not an excessive burden, it would have been a handy convenience to have been able to instead reference the user-visible pool name (
zeek
) in the URL and have it resolved for me, similar to howzapi load
accepts a user-visible pool name. #3132 has been opened to track this enhancement.A subtlety of the Logstash config should also be highlighted up front, as this will become significant later. Logstash can post JSON via its HTTP output plugin in one of two ways: 1) A single event per post, or, 2) An array of events. In other words, it does not have a way to post NDJSON. This was significant in my initial test because at that point Zed would not accept posts of JSON arrays via the API, though that has since changed (thanks to @nwt for #3124 and #3123). Because of this, I ended up using the event-per-post approach which I knew would be abusive to Zed since that meant one-commit-per-record in the pool, which has known scaling limitations that have not yet been addressed. Still, it seemed a worthwhile exercise to see what would happen.
At this point I started Logstash with this config so it would be ready to read the Zeek data.
Then I started Zeek.
I then started clicking through news articles in a browser to generate network traffic. As shown in the video below, at this point I could indeed see data appear in the app. A few observations:
I felt it was pretty cool that I could actually see the pool size increment in real time. Its changes were subtle enough that it did not feel like a big distraction as I was looking at the Zeek data, but its frequent changes also helped reinforce that I was looking at a "live" pool. However, when I showed this to the Dev team, someone noted that the query to the Zed backend to populate this value is apparently pretty heavyweight, so given the granularity of the one-commit-per-record of this config, I guess I was being pretty abusive to get this coolness.
Given the frequency of the updates, the omnipresent Refresh button at the bottom of the screen started to feel a little annoying. Once again, this config might not be representative of what users will ultimately do, but it might be worth UX consideration at some point.
This probably goes without saying, but if "live" use cases like this become more common, dashboard-like features are likely to follow, e.g., being able to have charts/tiles of the data in this pool in a form it could be put onto a TV for passive viewing, with those charts/tiles being updated continuously as the underlying data changes.
It's not captured in the video, but I quickly observed that the memory usage of the Zed process skyrocketed the more I queried data in the app. @nwt helpfully looked at some profiling and confirmed that this was likely due to known issue in-memory caching for commit objects and journal entries #3002. I repro'ed that symptom in a more general way and captured it in detail in a comment in that issue, in-memory caching for commit objects and journal entries #3002 (comment).
The data being posted here is unshaped JSON, which is why the time picker in the video is not populated (the timestamps are being treated as strings, etc.) This brought to mind the desire to shape the data, which is especially relevant for Zeek since there's a reference shaper that's ready-to-go. Thinking back to prior draft designs for a cloud service for Zed, at the time we anticipated the need for users to attach shapers to import endpoints so the incoming data could have rich data typing applied before being committed to the pool. In wondering about how this could best be achieved, no doubt it would be feasible to achieve this on the
zed lake serve
side, but I also thought of the recent work on zinger, which already has the concept of being able to apply shapers to incoming data, though at the moment only for data that's incoming off a Kafka topic. There was some internal discussion in the team about whether zinger might ultimately be a single-purpose tool that's Kafka-centric or more like a swiss-army-knife that could handle multiple inputs and outputs. Perhaps in this case it could have been the tool to handle the posts from Logstash, apply the shaper, and push data into the pool. I can't speak to if that's the correct approach, but I'm putting it out there as something to consider.As mentioned previously, the one-commit-per-record is known to be inefficient, and now that Zed can auto-detect the posts of JSON arrays, I'll perform a follow-on exercise to see how much better this performs if I leverage Logstash's apparent Nagle-like behaviors for batching events on the client side before posting them. That said, it might make sense for the Zed tooling to have some default/configurable Nagle-like behaviors on the receive side, as this might protect users from unknowingly stumbling into a suboptimal config (such as if they used an out-of-the-box Logstash config like I did here), or to adapt to tools that simply can't do their own Nagle-like buffering on the client side. @mccanne noted that zinger already does Nagle-like receive buffering, though once again, only for Kafka at the moment. More food for thought.
Here's the video:
Brim.mp4
The text was updated successfully, but these errors were encountered: