Reduce the amount the agent logs by default #4252

strawgate · 2024-02-13T17:19:49Z

Describe the enhancement:

We'd like an audit to occur on the current logged messages to make sure that the information at the default INFO log level is useful and is valuable when shipped to Fleet.

Describe a specific use case for the enhancement or feature:

10-15% of data produced by Agent is internal logs and metrics, we'd like to reduce the volume of Agent internal monitoring as excessive production of this data consumes significant disk space within customer environments.

What is the definition of done?

We have measurably reduced the amount of logs agent generates by default. Ideally we will have reduced agent monitoring log volume by 80% in normal usage and have run out of obvious ideas or superfluous messages to remove.

elasticmachine · 2024-03-07T10:10:22Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2024-04-08T17:47:21Z

I worry about doing this until elastic/kibana#158861 is implemented in case it turns out we forgot to convert some critical information to the warn level and it is needed for debugging. I don't want users with support cases having to manually revert to the info level on an agent by agent basis.

There are definitely a lot of low value logs at the info level today, so what we could do is identify those and improve/remove/convert them to debug. I think the spirit of this issue is to reduce the cost of storing agent logs, we can likely make significant progress on that without having to actually change the default log level.

pierrehilbert · 2024-04-08T18:32:39Z

Thx for raising this @cmacknz.
I agree here and set this one as blocked for now.

nimarezainia · 2024-04-09T03:39:05Z

@pierrehilbert We still have time to get all this in before FF. I think it's important for our users to have these changes. I will work with the Fleet team to make the UI changes that is required. But I suggest this work is not blocked and we move forward. This is not a big change (code wise not impact)

@cmacknz We have the policy override API available. Could we not use that in case there's an issue?

pierrehilbert · 2024-04-09T06:34:41Z

This was not part of current sprint and the FF is on Tuesday.
As you mentioned, this change is not a big one from a code perspective so let's continue to consider it as blocked and if Fleet is able to have everything done before FF, we will rediscuss this.

cmacknz · 2024-04-09T14:30:03Z

@cmacknz We have the policy override API available. Could we not use that in case there's an issue?

No I don't think agent respects log level changes from the policy right now.

I also think this issue is prescribing a solution instead of stating the problem. I think we want to send fewer logs by default at all levels. Changing the default level to warn is just a quick way to reduce log volume, but also by removing critical information.

There is information at the info level that we use for debugging constantly that makes no sense as a warning message, for example the 30s output statistic summaries would be lost making performance problems impossible to see in diagnostics. We could put this at the warning level, but they aren't really warnings.

amitkanfer · 2024-04-09T17:58:48Z

There is information at the info level that we use for debugging constantly that makes no sense as a warning message, for example the 30s output statistic summaries would be lost making performance problems impossible to see in diagnostics. We could put this at the warning level, but they aren't really warnings.

Possible solution for this is to introduce an ALWAYS log level. Something i did successfully in the past. This level will log all those messages we always want to log (similar to ERROR i believe, in terms of severity), but they are not errors nor warnings. We'll need to spend some time to identify such logs ofcourse, but this will unblock us from changing the default to WARN.

nimarezainia · 2024-04-10T03:25:54Z

This was not part of current sprint and the FF is on Tuesday. As you mentioned, this change is not a big one from a code perspective so let's continue to consider it as blocked and if Fleet is able to have everything done before FF, we will rediscuss this.

@pierrehilbert the Fleet issue is scheduled for sp27. It is dependent on #2851 so we have a circular situation. #2851 is indeed sp27. So I'm asking for 2851 to be completed.

@cmacknz Fleet issue and #2851 will set the default to Warn for every policy. Which will address the concern. We can use this issue to work on readjusting which category various logs belong to which is a longer process.

cmacknz · 2024-04-10T14:41:03Z

@cmacknz elastic/kibana#158861 and #2851 will set the default to Warn for every policy.

This issue has to be done first before we can change the default level to warn. We can add the ability to change the log level on a policy basis, and customers that care can opt in to that, but I don't want to change the default until we are sure the default has the information we want..

If we change the default log level to warn without adjusting what is logged at the warn level, then every single support case will have the additional step "adjust the log level back to info for the affected agents".

This adds friction and will increase how long it takes to resolve support escalations. I don't think this is a good thing.

nimarezainia · 2024-04-11T03:26:07Z

thanks @cmacknz. then I suggest we arrange the deck chairs since the other two issues are currently slated for completion in sp27. This issue shouldn't have been marked as blocked by elastic/kibana#158861 - i understand why it was done as we need a way to quickly go back to the old default value. But either way as you said this issue has to be completed first.

cmacknz · 2024-04-11T13:23:16Z

Yes, I retitled this to "reduce the amount the agent logs" since I don't think it matters what the log level actually is. The point is we can log less than we currently do by default and we need to investigate how to fix that.

Whether we are at info or warn doesn't matter, if we can get info to remain useful and log as much as warn does today we don't need to change the default log level at all.

strawgate · 2024-04-15T16:52:18Z

30s output statistic summaries would be lost making performance problems impossible to see in diagnostics

could the 30s output statistics messages just be captured by internal agent metricbeat monitoring? or why is it a log line?

cmacknz · 2024-04-15T17:14:47Z

or why is it a log line?

So it can be in diagnostics captured locally from the agent. Often that is all we get when debugging user issues and we need the data they include, specifically we often need the history they provide over time.

There are other ways to solve the problem the 30s metric summaries in the logs are solving, but we can't just get rid of them without some replacement for the data they provide.

We could stop shipping those metrics summaries to fleet though by adding a drop processor in the monitoring filebeat.

nimarezainia · 2024-04-23T09:02:38Z

As per discussions, moving this issue to sprint 28 so it can be done in the 8.14 time frame.

nimarezainia · 2024-04-30T05:13:56Z

@pchila thanks for working on the PR. Would it be possible to get the final benchmark on this and the metricset changes to see what is the percentage reduction all together for default use cases. thank you

pierrehilbert · 2024-04-30T07:11:04Z

@nimarezainia, @pchila asked ES team for some advice to do this kind of benchmark. For now, only the metricset change has been merged so you won't get more than this one.

cmacknz · 2024-04-30T13:46:38Z

The data stream API can tell us the size the logs data streams currently take on disk. https://www.elastic.co/guide/en/elasticsearch/reference/current/data-stream-stats-api.html

The main problem will be in creating properly comparable data sets. We need to make sure we cover the same time range for the same operations to avoid doing size calculations where one of the datasets naturally contains more logs because it did more work, or covers a longer time period.

I think we almost want something like the Filebeat --once or an --until command line parameter for the agent that runs for exactly some amount of time and then stops. That might get us close enough, there is always the chance there are a couple of extra log lines from one of the agents but if we run the test for long enough that hopefully won't have a large impact.

We could take a similar approach and run two agents on two Ubuntu VMs started at approximately the same time, pointed at two different ES clusters, then compare the data stream sizes after a period of time.

A third way would be to have the agent log to a file for some well defined period of time, then truncate to align the final timestamps to match, and ingest the results into ES separately and compare the sizes.

pchila · 2024-04-30T13:52:31Z

@cmacknz I am actually working on a fourth way (I call this the "salami slice index" method) :

Clone the index template logs-elastic_agent onto a new template without using it for a data stream
Reindex a subset of the logs we have on the logs-elastic_agent-* datastream onto a new index matching the new index template (we can select by agent id, agent version and/or timeframe etc.) as a baseline (first slice)
Reindex another subset of logs for the updated version for the same timeframe on another index (second slice)
At this point we can compare the indices and it should be more apples to apples comparison...

Edit: forgot to mention, I am preparing a small script for setting up the templates, indices etc (if useful I could commit it in the elastic-agent repo) but testing the method I noticed that removing the periodic Non-zero metrics... message reduces the index size by ~9.2% on a short agent run (~1 hour)

strawgate · 2024-05-02T14:53:24Z

can we test enabling synthetic source in the index_template? that may be a ~50% win

pchila · 2024-05-03T07:00:07Z

I redid my measurement for what we gain by removing the Non-zero metrics... log in the related PR description #4633 (comment)

elasticmachine · 2024-05-04T02:24:39Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

pchila · 2024-05-09T18:39:50Z

can we test enabling synthetic source in the index_template? that may be a ~50% win

Did a quick check on the synthetic source (according to the documentation it's GA only on time_series indices) and elastic-agent metrics indices in 8.15.0 already have this setting:

so not much to gain there, unfortunately

pchila · 2024-05-09T18:49:00Z

@strawgate #2851 has a PR open and it should be merged during the current sprint...

I was wondering if the definition of done for this issue is still the one quoted below

What is the definition of done?

Agent logs at Warn level by default or switch to warn level shortly after Agent startup.

Start agent, notice no more Info level messages and an overall reduction in log volume.

According to @nimarezainia 's comment on the kibana issue elastic/kibana#180778 (comment) I think we want the initial log level to be INFO ? Or we still want to have warning as initial log level ?

Furthermore, since the last 8.14.0 BC has already started I am not sure we can cram more code modifications there so this will likely be implemented in a future release

strawgate · 2024-05-09T18:54:08Z

The definition of done is probably something like: reduce agent monitoring log volume by 80% in normal usage.

Initially, we thought this would be done by switching the default log level as doing that is easily reversible by customers. Later the decision was made to just trim the logs present in Info.

cmacknz · 2024-05-09T18:55:20Z

Yes, I just adjusted the description and definition of done.

We can still target changes to 8.14.1 by backporting them after the 8.14.0 release.

ycombinator · 2024-05-23T16:02:50Z

Next step, as discussed in today's team meeting:

Setup Agent with a typical/common scenario: system + defend + osquery integrations + Agent logs and metrics monitoring enabled.
Using this setup, run Agent for a reasonable period of time (5 minutes?) so that we collect enough log data to see statistically significant patterns (see next point)
Figure out statistical distribution of the logs and use this to figure the most frequently-emitted logs, allowing us to focus on these towards the goal of reducing the amount of agent logs by default.

ycombinator · 2024-05-28T17:30:06Z

Setup Agent with a typical/common scenario: system + defend + osquery integrations + Agent logs and metrics monitoring enabled.

@strawgate Are you good with using this particular setup/profile from here on out for measuring any improvements from reducing the amount of Agent logs?

strawgate · 2024-05-28T17:44:28Z

For logs-elastic_agent* we should target reduction, in this order:

logs-elastic_agent.filebeat-*
logs-elastic_agent.metricbeat-*
logs-elastic_agent-*

We can assume all other logs are a rounding error.

For a sense of scale, we should assume:

a 30% reduction in elastic_agent.filebeat logs is worth a 100% reduction in elastic_agent.metricbeat logs
a 50% reduction in elastic_agent.metricbeat is worth a 100% reduction in elastic_agent logs.
Improvements to any other datastreams are functionally incidental

We should pick the framework that can best allow us to measure total volume and then reductions in the above data streams.

ycombinator · 2024-05-28T19:21:23Z

Followed up with Bill off-issue re: my question about coming with up a standard Agent profile we should use for benchmarking log reduction improvements. Bringing our conversation back into this issue:

@ycombinator: Are you good with using this particular setup/profile from here on out for measuring any improvements from reducing the amount of Agent logs?
@strawgate: I dont think we need defend or osquery and I think we should find whatever integrations leverage filebeat and metricbeat and enable as many as we can reasonably enable.

@cmacknz IIRC, you'd suggested in last week's team meeting that we enable defend and osquery. Could you and @strawgate work out whether we should or should not enable these as part of a standard profile for benchmarking Agent Similarly, we need a list of specific integrations that leverage Filebeat and Metricbeat because "as many as we can" could be pretty much every integration, so we need some specific narrowing criteria implied by "reasonably enable". We will resume work on this issue once we have this specific Agent profile defined.

cmacknz · 2024-05-28T19:37:08Z

Installing the system integration is all you need to get a single instance of Filebeat and Metricbeat running. System is also great because the data source is the machine agent is running on.

Running more integrations would be great but we need them to do real work to find out if they log too much, and that is much more complicated than system.

Let's just start with a basic baseline of Elastic Agent with the System integration installed to keep things as simple as possible. Most of the improvements we have brainstormed so far are testable with just that.

strawgate · 2024-05-28T21:38:39Z

reasonably enable

Meant without really doing anything :)

I'll reach out to see if we have any customers that might be willing to run log data analysis and provide us a result like:

Im also wondering if we might want to setup some integrations in blank states (nothing configured) as a one-off to see if there are any producing large volumes of errors (like the nginx file has no content error in the screenshot).

strawgate mentioned this issue Feb 13, 2024

[Fleet] Change default agent log level from info to warning elastic/kibana#175555

Closed

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Mar 7, 2024

pierrehilbert assigned AndersonQ and pchila and unassigned AndersonQ Apr 8, 2024

criamico mentioned this issue Apr 10, 2024

[Fleet] Support changing the default log level per policy (or globally) elastic/kibana#158861

Closed

cmacknz changed the title ~~Change default log level from Info to Warn~~ Reduce the amount the agent logs by default Apr 11, 2024

pchila mentioned this issue Apr 29, 2024

Reduce agent logs by default #4633

Merged

3 tasks

ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 4, 2024

mergify bot mentioned this issue May 6, 2024

[8.14](backport #4633) Reduce agent logs by default #4682

Merged

3 tasks

jlind23 mentioned this issue May 14, 2024

[Elastic Agent] Improve error logging (reduce to Info Level where possible) wrt changing logging level to debug #158

Closed

ycombinator unassigned pchila Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the amount the agent logs by default #4252

Reduce the amount the agent logs by default #4252

strawgate commented Feb 13, 2024 •

edited by cmacknz

Loading

elasticmachine commented Mar 7, 2024

cmacknz commented Apr 8, 2024

pierrehilbert commented Apr 8, 2024

nimarezainia commented Apr 9, 2024

pierrehilbert commented Apr 9, 2024

cmacknz commented Apr 9, 2024

amitkanfer commented Apr 9, 2024

nimarezainia commented Apr 10, 2024

cmacknz commented Apr 10, 2024

nimarezainia commented Apr 11, 2024 •

edited

Loading

cmacknz commented Apr 11, 2024

strawgate commented Apr 15, 2024

cmacknz commented Apr 15, 2024

nimarezainia commented Apr 23, 2024 •

edited by pierrehilbert

Loading

nimarezainia commented Apr 30, 2024

pierrehilbert commented Apr 30, 2024

cmacknz commented Apr 30, 2024

pchila commented Apr 30, 2024 •

edited

Loading

strawgate commented May 2, 2024

pchila commented May 3, 2024

elasticmachine commented May 4, 2024

pchila commented May 9, 2024

pchila commented May 9, 2024

strawgate commented May 9, 2024

cmacknz commented May 9, 2024

ycombinator commented May 23, 2024 •

edited

Loading

ycombinator commented May 28, 2024 •

edited

Loading

strawgate commented May 28, 2024 •

edited

Loading

ycombinator commented May 28, 2024 •

edited

Loading

cmacknz commented May 28, 2024

strawgate commented May 28, 2024 •

edited

Loading

Reduce the amount the agent logs by default #4252

Reduce the amount the agent logs by default #4252

Comments

strawgate commented Feb 13, 2024 • edited by cmacknz Loading

elasticmachine commented Mar 7, 2024

cmacknz commented Apr 8, 2024

pierrehilbert commented Apr 8, 2024

nimarezainia commented Apr 9, 2024

pierrehilbert commented Apr 9, 2024

cmacknz commented Apr 9, 2024

amitkanfer commented Apr 9, 2024

nimarezainia commented Apr 10, 2024

cmacknz commented Apr 10, 2024

nimarezainia commented Apr 11, 2024 • edited Loading

cmacknz commented Apr 11, 2024

strawgate commented Apr 15, 2024

cmacknz commented Apr 15, 2024

nimarezainia commented Apr 23, 2024 • edited by pierrehilbert Loading

nimarezainia commented Apr 30, 2024

pierrehilbert commented Apr 30, 2024

cmacknz commented Apr 30, 2024

pchila commented Apr 30, 2024 • edited Loading

strawgate commented May 2, 2024

pchila commented May 3, 2024

elasticmachine commented May 4, 2024

pchila commented May 9, 2024

pchila commented May 9, 2024

strawgate commented May 9, 2024

cmacknz commented May 9, 2024

ycombinator commented May 23, 2024 • edited Loading

ycombinator commented May 28, 2024 • edited Loading

strawgate commented May 28, 2024 • edited Loading

ycombinator commented May 28, 2024 • edited Loading

cmacknz commented May 28, 2024

strawgate commented May 28, 2024 • edited Loading

strawgate commented Feb 13, 2024 •

edited by cmacknz

Loading

nimarezainia commented Apr 11, 2024 •

edited

Loading

nimarezainia commented Apr 23, 2024 •

edited by pierrehilbert

Loading

pchila commented Apr 30, 2024 •

edited

Loading

ycombinator commented May 23, 2024 •

edited

Loading

ycombinator commented May 28, 2024 •

edited

Loading

strawgate commented May 28, 2024 •

edited

Loading

ycombinator commented May 28, 2024 •

edited

Loading

strawgate commented May 28, 2024 •

edited

Loading