"Live" data import with Logstash #3151

philrz · 2021-09-30T18:02:51Z

I recently ran an experiment to see what would happen if I used an external agent (specifically, Logstash) to push data to a Zed lake rather than using zapi load. I'll note up front that Logstash is surely not current state-of-the-art in terms of import agents, but it's one with which I'm already familiar, so I figured I'd start there in the interest of time.

Overall, it showed that the necessary pieces are all in place, but there were some speedbumps along the way that point to issues that may be worth addressing.

The steps below were performed with Brim commit 5da5e7d (which uses Zed commit fe1cb75) and Logstash 7.15.0. I also happened to use Zeek 4.1.1 running on my local laptop as the "live" data source, just because it was familiar and close at hand. Other than pointing it at the wireless interface on my Mac laptop to sniff "live" traffic, the only other change from Zeek defaults was the addition of the json-streaming-logs package. I also did not do any tuning to Logstash, using just this simple config file logstash.conf:

input {
  file {
    path => "/usr/local/zeek-4.1.1/logs/current/json_streaming_*"
  }
}

filter {
  json {
    source => "message"
  }
  mutate {
    remove_field => [ "message", "@timestamp", "@version", "path", "host" ]
  }
}

output {
  stdout {
    codec => "json"
  }
  http {
    url => "http://127.0.0.1:9867/pool/1ys2uIgnrlu5iM9hGc8ObhhxBOp/branch/main"
    http_method => "post"
    pool_max => 1
  }
}

In brief, the input stanza watches the logs generated by the json-streaming-logs package in a tail-like manner. The filter stanza ensures that the lines in the log file are interpreted as JSON and turned into key/value pairs, then a bunch of Logstash's automatically-added metadata fields are dropped so they won't appear in the output. Finally, the output stanza sends a copy of all the posted data to the console for debug purposes, but most importantly posts data to the Zed API endpoint that accepts data to be loaded to a pool. The pool_max setting also turns out to be important because by default Logstash attempts to open 50 connections in parallel to post data in the name of performance, but Zed doesn't care for this, so we've lowered it to just one connection (thanks to @nwt for isolating the importance of this setting!)

To put this to use, I first launched Brim, which starts up a Zed lake. Since GA Brim doesn't yet have a way to create empty pools, I used zapi create outside the app.

$ zapi create zeek
pool created: zeek 1ys2uIgnrlu5iM9hGc8ObhhxBOp

This led to my first pause as a user. While cut & pasting the opaque pool ID was not an excessive burden, it would have been a handy convenience to have been able to instead reference the user-visible pool name (zeek) in the URL and have it resolved for me, similar to how zapi load accepts a user-visible pool name. #3132 has been opened to track this enhancement.

A subtlety of the Logstash config should also be highlighted up front, as this will become significant later. Logstash can post JSON via its HTTP output plugin in one of two ways: 1) A single event per post, or, 2) An array of events. In other words, it does not have a way to post NDJSON. This was significant in my initial test because at that point Zed would not accept posts of JSON arrays via the API, though that has since changed (thanks to @nwt for #3124 and #3123). Because of this, I ended up using the event-per-post approach which I knew would be abusive to Zed since that meant one-commit-per-record in the pool, which has known scaling limitations that have not yet been addressed. Still, it seemed a worthwhile exercise to see what would happen.

At this point I started Logstash with this config so it would be ready to read the Zeek data.

$ bin/logstash -f ~/logstash.conf 
Using bundled JDK: /Users/phil/work/logstash/logstash-7.15.0/jdk.app/Contents/Home
...
[2021-09-30T11:27:36,247][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}

Then I started Zeek.

$ sudo /usr/local/zeek-4.1.1/bin/zeekctl 

Welcome to ZeekControl 2.3.0-5

Type "help" for help.

[ZeekControl] > deploy
checking configurations ...
installing ...
...
starting zeek ...

I then started clicking through news articles in a browser to generate network traffic. As shown in the video below, at this point I could indeed see data appear in the app. A few observations:

I felt it was pretty cool that I could actually see the pool size increment in real time. Its changes were subtle enough that it did not feel like a big distraction as I was looking at the Zeek data, but its frequent changes also helped reinforce that I was looking at a "live" pool. However, when I showed this to the Dev team, someone noted that the query to the Zed backend to populate this value is apparently pretty heavyweight, so given the granularity of the one-commit-per-record of this config, I guess I was being pretty abusive to get this coolness.
Given the frequency of the updates, the omnipresent Refresh button at the bottom of the screen started to feel a little annoying. Once again, this config might not be representative of what users will ultimately do, but it might be worth UX consideration at some point.
This probably goes without saying, but if "live" use cases like this become more common, dashboard-like features are likely to follow, e.g., being able to have charts/tiles of the data in this pool in a form it could be put onto a TV for passive viewing, with those charts/tiles being updated continuously as the underlying data changes.
It's not captured in the video, but I quickly observed that the memory usage of the Zed process skyrocketed the more I queried data in the app. @nwt helpfully looked at some profiling and confirmed that this was likely due to known issue in-memory caching for commit objects and journal entries #3002. I repro'ed that symptom in a more general way and captured it in detail in a comment in that issue, in-memory caching for commit objects and journal entries #3002 (comment).
The data being posted here is unshaped JSON, which is why the time picker in the video is not populated (the timestamps are being treated as strings, etc.) This brought to mind the desire to shape the data, which is especially relevant for Zeek since there's a reference shaper that's ready-to-go. Thinking back to prior draft designs for a cloud service for Zed, at the time we anticipated the need for users to attach shapers to import endpoints so the incoming data could have rich data typing applied before being committed to the pool. In wondering about how this could best be achieved, no doubt it would be feasible to achieve this on the zed lake serve side, but I also thought of the recent work on zinger, which already has the concept of being able to apply shapers to incoming data, though at the moment only for data that's incoming off a Kafka topic. There was some internal discussion in the team about whether zinger might ultimately be a single-purpose tool that's Kafka-centric or more like a swiss-army-knife that could handle multiple inputs and outputs. Perhaps in this case it could have been the tool to handle the posts from Logstash, apply the shaper, and push data into the pool. I can't speak to if that's the correct approach, but I'm putting it out there as something to consider.
As mentioned previously, the one-commit-per-record is known to be inefficient, and now that Zed can auto-detect the posts of JSON arrays, I'll perform a follow-on exercise to see how much better this performs if I leverage Logstash's apparent Nagle-like behaviors for batching events on the client side before posting them. That said, it might make sense for the Zed tooling to have some default/configurable Nagle-like behaviors on the receive side, as this might protect users from unknowingly stumbling into a suboptimal config (such as if they used an out-of-the-box Logstash config like I did here), or to adapt to tools that simply can't do their own Nagle-like buffering on the client side. @mccanne noted that zinger already does Nagle-like receive buffering, though once again, only for Kafka at the moment. More food for thought.

Here's the video:

Brim.mp4

The text was updated successfully, but these errors were encountered:

philrz · 2021-09-30T21:42:47Z

As mentioned above. I did a follow-on experiment where I played around with Logstash's Nagle-like configuration options to avoid the abusive one-commit-per-record behavior of the first test.

In my first variation, I simply added a setting format => "json_batch" in the HTTP output configuration in my logstash.conf. This did indeed cause some events to be batched up into JSON arrays before being flushed to Zed. However, the default Logstash settings seem to be otherwise optimized for high throughput and low latency, and as a result, once Zeek got through its initial flush of bulk startup events (e.g., loaded_scripts) I could see in Wireshark that many of the flushed arrays for my "regular" network traffic contained only 2 or 3 events. After I'd left it running for several minutes, the total count of received Zeek events in my pool was about 1200 but the total commit count in the pool (zapi log | grep loaded | wc -l) was about 500. This alone actually made the Brim app more responsive than previously when querying the data, but it still wasn't being quite as gentle to the Zed lake as I was hoping to be.

I then made these two settings in the config/logstash.yml:

pipeline.workers: 1
pipeline.batch.delay: 30000

Much like the pool_max => 1 that was described previously, my pipeline.workers setting here is an attempt to avoid parallelism to make the results of the experiment easier to observe (by default, it creates one worker per CPU core and does separate batching on each worker). The pipeline.batch.delay specifies "How long to wait in milliseconds while polling for the next event before dispatching an undersized batch to filters+outputs" and defaults to 50, which explains the small flush sizes we saw previously. For the sake of the experiment, I upped this value to 30000. My hope (*) was this combination of values would have Logstash attempt to accumulate 125 events to post all at once, but if 30 seconds elapsed before it got to 125, it would flush what it has.

While I don't claim that these are ideal tuning configurations to use in production, running with these settings ultimately had something like the desired effect. I was able to leave it running for several minutes as I browsed web sites and imported 4000 Zeek events that were ultimately spread across only 32 commits. Also, while the memory problems previously identified could surely still be aggravated if this config were left running indefinitely, whereas before I saw the Zed process memory usage skyrocket quickly to 6+ GB, in this case it peaked at a reasonable 114 MB.

(*) - Based on the numbers, I don't think the config had quite the effect I was hoping for. It's too suspicious that 4000 / 32 = 125, and 125 is the default setting for pipeline.batch.size. Between this and the fact that in Wireshark I could see some delays of longer than 30 seconds between posts, it seems like it was consistently waiting for the full-sized batches, which was not my intent. However, I did a final variation where I left the pipeline.batch.delay at its default of 50 milliseconds and hence relied on only the pipeline.workers setting having been varied, and after several minutes running in that configuration I had 1753 Zeek events spread across 202 commits. In other words, there was more batching going on than in the initial json_batch test with all other config settings at defaults, but nothing like the level of batching I was hoping to get. In conclusion, I didn't want to turn this into an exhaustive study of Logstash tunning, and personally I think this all continues to make the case for some kind of Zed-side Nagle that we can control rather than trying to ask users to figure this out on every client they might bring to the table (or figure them all out ourselves and publish best practices.)

mccanne transferred this issue from brimdata/zui Oct 6, 2021

mccanne mentioned this issue Oct 6, 2021

integrations docs #3152

Closed

mccanne added this to the ETL Lake milestone Oct 6, 2021

philrz mentioned this issue Sep 27, 2022

Continuous ingest #4107

Open

philrz removed this from the ETL Lake milestone Oct 25, 2022

This was referenced Dec 12, 2022

Buffering for live ingest #4266

Open

"Live" data import with Fluentd #4271

Closed

philrz mentioned this issue Jan 26, 2023

Support additional from datasources within Zed #4337

Open

philrz mentioned this issue Oct 31, 2023

Add grok function #4827

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Live" data import with Logstash #3151

"Live" data import with Logstash #3151

philrz commented Sep 30, 2021 •

edited

Loading

philrz commented Sep 30, 2021

"Live" data import with Logstash #3151

"Live" data import with Logstash #3151

Comments

philrz commented Sep 30, 2021 • edited Loading

philrz commented Sep 30, 2021

philrz commented Sep 30, 2021 •

edited

Loading