-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Live" data import with Fluentd #4271
Comments
I revived this config while testing the compaction currently performed by lake manager (#3923). This made me wish the JSON Zeek data could actually be properly shaped so I'd have real
Things to note:
|
This looks awesome! +1 for ZNG support <3 https://zed.brimdata.io/docs/commands/zq#performance-comparisons |
I wonder what would happen if you replaced https://github.com/corelight/json-streaming-logs/blob/master/scripts/main.zeek#L127 with https://github.com/zeek/zeek/blob/master/scripts/base/frameworks/logging/writers/ascii.zeek#L12 |
@mrbluecoat: Glad to hear you think it sounds encouraging! This material actually hasn't been updated in a while and the Zed lake has evolved a bit in that time (and I imagine Fluentd might have changed a bit too). I've been starting to more formally document these kinds of topics in the "Integrations" area of the Zed docs and would be keen to do that with this Fluentd stuff if it's something you could see putting to use. However, as I'm doing that, I'd like to make sure that whatever it covers is relevant to your goals. I imagine GitHub Issue comments might not be the best forum for laying out those details, so could you ping me through other channels so we could discuss? If you like Slack you could send yourself an invite to our community Slack workspace and message me |
Will do, thanks |
As anticipated in my previous comment here, I did end up writing an article expanding and formalizing the config originally sketched out in this issue. The article can be found at https://zed.brimdata.io/docs/next/integrations/fluentd and will be tagged with the docs version for the next GA Zed release which is expected soon. The user that chimed in here in previous comments did follow up on Slack and reported having followed the article and was successful in observing live data import. Therefore I'm going to go ahead and close this issue with the expectation that the article may continue to be improved over time as users continue to find/follow it and share their experiences. |
In an exercise similar to #3151 with Logstash, I've run a test with external agent Fluentd to push "live" data continuously to a Zed lake.
tl;dr - As Fluentd is a more modern tool than Logstash, in the end it did indeed seem to play more nicely with Zed in a minimal out-of-the-box config.
The test was performed with a Zed lake running behind Zui insiders 0.30.1-142 (which is Zed commit
101d358
), Fluentd1.15.3
, and Zeek5.1.1
(with the json-streaming-logs package) sniffing the coincidental live local traffic on my wireless interface during a typical work-from-home day.The
fluentd.conf
I ultimately settled on:Then to start Fluentd:
The
ulimit
guidance came from the "Before Installation" docs and did end up being significant, as the default setting on my Macbook was256
and in an early run Fluentd did indeed complain of running out of file descriptors.The only minor speedbump I hit along the way was needing to specify
content_type application/json
. At first after reading the relevant docs I thought the defaults would work since the doc explains thatjson_array
isfalse
by default and hence it posts in NDJSON (whichzed serve
accepts) and sets Content Type toapplication/x-ndjson
. However, this caused an HTTP 400 error because the Zed backend handles both regular JSON and NDJSON through the same reader and this is wired up only through Content Typeapplication/json
, so I had to set that explicitcontent_type
.Other than that, the defaults are way more friendly than what we saw with Logstash. As #3151 gets into, Logstash's default behavior was to either send a single JSON object per post or be configured to send a JSON array of objects. The former is known to perform poorly (#4266) and with the latter the posts end up in the pool as whole arrays and hence currently need to be post-processed in some way if they're going to be treated as individual Zed records. By comparison, Fluentd's default behavior not only sends NDJSON but also does so at a
flush_interval
of60s
, which means more records per commit and hence better performance. There's also a lot of tuning parameters at that page that indicate users have a lot more flexibility at their disposal if these defaults are undesirable.Having let it run for a couple hours,
7523
Zeek events have accumumulated in my pool, and here's a look at the metadata that shows object size/count:Applying some crude math, the average data object size/count:
Obviously that size is still way below the default 500 MB object size and hence we'd still be waiting a long time to see benefit from compaction. However, compared to the single-object commits discussed in #3151 and #4266, the data as stored here is much more forgiving when accessed in queries, e.g., counting is still very quick:
The text was updated successfully, but these errors were encountered: