Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the amount the agent logs by default #4252

Open
strawgate opened this issue Feb 13, 2024 · 31 comments
Open

Reduce the amount the agent logs by default #4252

strawgate opened this issue Feb 13, 2024 · 31 comments
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@strawgate
Copy link
Contributor

strawgate commented Feb 13, 2024

Describe the enhancement:

We'd like an audit to occur on the current logged messages to make sure that the information at the default INFO log level is useful and is valuable when shipped to Fleet.

Describe a specific use case for the enhancement or feature:

10-15% of data produced by Agent is internal logs and metrics, we'd like to reduce the volume of Agent internal monitoring as excessive production of this data consumes significant disk space within customer environments.

What is the definition of done?

We have measurably reduced the amount of logs agent generates by default. Ideally we will have reduced agent monitoring log volume by 80% in normal usage and have run out of obvious ideas or superfluous messages to remove.

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pierrehilbert pierrehilbert assigned AndersonQ and pchila and unassigned AndersonQ Apr 8, 2024
@cmacknz
Copy link
Member

cmacknz commented Apr 8, 2024

I worry about doing this until elastic/kibana#158861 is implemented in case it turns out we forgot to convert some critical information to the warn level and it is needed for debugging. I don't want users with support cases having to manually revert to the info level on an agent by agent basis.

There are definitely a lot of low value logs at the info level today, so what we could do is identify those and improve/remove/convert them to debug. I think the spirit of this issue is to reduce the cost of storing agent logs, we can likely make significant progress on that without having to actually change the default log level.

@pierrehilbert
Copy link
Contributor

Thx for raising this @cmacknz.
I agree here and set this one as blocked for now.

@nimarezainia
Copy link
Contributor

@pierrehilbert We still have time to get all this in before FF. I think it's important for our users to have these changes. I will work with the Fleet team to make the UI changes that is required. But I suggest this work is not blocked and we move forward. This is not a big change (code wise not impact)

@cmacknz We have the policy override API available. Could we not use that in case there's an issue?

@pierrehilbert
Copy link
Contributor

This was not part of current sprint and the FF is on Tuesday.
As you mentioned, this change is not a big one from a code perspective so let's continue to consider it as blocked and if Fleet is able to have everything done before FF, we will rediscuss this.

@cmacknz
Copy link
Member

cmacknz commented Apr 9, 2024

@cmacknz We have the policy override API available. Could we not use that in case there's an issue?

No I don't think agent respects log level changes from the policy right now.

I also think this issue is prescribing a solution instead of stating the problem. I think we want to send fewer logs by default at all levels. Changing the default level to warn is just a quick way to reduce log volume, but also by removing critical information.

There is information at the info level that we use for debugging constantly that makes no sense as a warning message, for example the 30s output statistic summaries would be lost making performance problems impossible to see in diagnostics. We could put this at the warning level, but they aren't really warnings.

@amitkanfer
Copy link
Contributor

There is information at the info level that we use for debugging constantly that makes no sense as a warning message, for example the 30s output statistic summaries would be lost making performance problems impossible to see in diagnostics. We could put this at the warning level, but they aren't really warnings.

Possible solution for this is to introduce an ALWAYS log level. Something i did successfully in the past. This level will log all those messages we always want to log (similar to ERROR i believe, in terms of severity), but they are not errors nor warnings. We'll need to spend some time to identify such logs ofcourse, but this will unblock us from changing the default to WARN.

@nimarezainia
Copy link
Contributor

This was not part of current sprint and the FF is on Tuesday. As you mentioned, this change is not a big one from a code perspective so let's continue to consider it as blocked and if Fleet is able to have everything done before FF, we will rediscuss this.

@pierrehilbert the Fleet issue is scheduled for sp27. It is dependent on #2851 so we have a circular situation. #2851 is indeed sp27. So I'm asking for 2851 to be completed.

@cmacknz Fleet issue and #2851 will set the default to Warn for every policy. Which will address the concern. We can use this issue to work on readjusting which category various logs belong to which is a longer process.

@cmacknz
Copy link
Member

cmacknz commented Apr 10, 2024

@cmacknz elastic/kibana#158861 and #2851 will set the default to Warn for every policy.

This issue has to be done first before we can change the default level to warn. We can add the ability to change the log level on a policy basis, and customers that care can opt in to that, but I don't want to change the default until we are sure the default has the information we want..

If we change the default log level to warn without adjusting what is logged at the warn level, then every single support case will have the additional step "adjust the log level back to info for the affected agents".

This adds friction and will increase how long it takes to resolve support escalations. I don't think this is a good thing.

@nimarezainia
Copy link
Contributor

nimarezainia commented Apr 11, 2024

thanks @cmacknz. then I suggest we arrange the deck chairs since the other two issues are currently slated for completion in sp27. This issue shouldn't have been marked as blocked by elastic/kibana#158861 - i understand why it was done as we need a way to quickly go back to the old default value. But either way as you said this issue has to be completed first.

@cmacknz cmacknz changed the title Change default log level from Info to Warn Reduce the amount the agent logs by default Apr 11, 2024
@cmacknz
Copy link
Member

cmacknz commented Apr 11, 2024

Yes, I retitled this to "reduce the amount the agent logs" since I don't think it matters what the log level actually is. The point is we can log less than we currently do by default and we need to investigate how to fix that.

Whether we are at info or warn doesn't matter, if we can get info to remain useful and log as much as warn does today we don't need to change the default log level at all.

@strawgate
Copy link
Contributor Author

30s output statistic summaries would be lost making performance problems impossible to see in diagnostics

could the 30s output statistics messages just be captured by internal agent metricbeat monitoring? or why is it a log line?

@cmacknz
Copy link
Member

cmacknz commented Apr 15, 2024

or why is it a log line?

So it can be in diagnostics captured locally from the agent. Often that is all we get when debugging user issues and we need the data they include, specifically we often need the history they provide over time.

There are other ways to solve the problem the 30s metric summaries in the logs are solving, but we can't just get rid of them without some replacement for the data they provide.

We could stop shipping those metrics summaries to fleet though by adding a drop processor in the monitoring filebeat.

@nimarezainia
Copy link
Contributor

nimarezainia commented Apr 23, 2024

As per discussions, moving this issue to sprint 28 so it can be done in the 8.14 time frame.

@nimarezainia
Copy link
Contributor

@pchila thanks for working on the PR. Would it be possible to get the final benchmark on this and the metricset changes to see what is the percentage reduction all together for default use cases. thank you

@pierrehilbert
Copy link
Contributor

@nimarezainia, @pchila asked ES team for some advice to do this kind of benchmark. For now, only the metricset change has been merged so you won't get more than this one.

@cmacknz
Copy link
Member

cmacknz commented Apr 30, 2024

The data stream API can tell us the size the logs data streams currently take on disk. https://www.elastic.co/guide/en/elasticsearch/reference/current/data-stream-stats-api.html

The main problem will be in creating properly comparable data sets. We need to make sure we cover the same time range for the same operations to avoid doing size calculations where one of the datasets naturally contains more logs because it did more work, or covers a longer time period.

I think we almost want something like the Filebeat --once or an --until command line parameter for the agent that runs for exactly some amount of time and then stops. That might get us close enough, there is always the chance there are a couple of extra log lines from one of the agents but if we run the test for long enough that hopefully won't have a large impact.

We could take a similar approach and run two agents on two Ubuntu VMs started at approximately the same time, pointed at two different ES clusters, then compare the data stream sizes after a period of time.

A third way would be to have the agent log to a file for some well defined period of time, then truncate to align the final timestamps to match, and ingest the results into ES separately and compare the sizes.

@pchila
Copy link
Member

pchila commented Apr 30, 2024

@cmacknz I am actually working on a fourth way (I call this the "salami slice index" method) :

  • Clone the index template logs-elastic_agent onto a new template without using it for a data stream
  • Reindex a subset of the logs we have on the logs-elastic_agent-* datastream onto a new index matching the new index template (we can select by agent id, agent version and/or timeframe etc.) as a baseline (first slice)
  • Reindex another subset of logs for the updated version for the same timeframe on another index (second slice)
    At this point we can compare the indices and it should be more apples to apples comparison...

Edit: forgot to mention, I am preparing a small script for setting up the templates, indices etc (if useful I could commit it in the elastic-agent repo) but testing the method I noticed that removing the periodic Non-zero metrics... message reduces the index size by ~9.2% on a short agent run (~1 hour)

@strawgate
Copy link
Contributor Author

can we test enabling synthetic source in the index_template? that may be a ~50% win

@pchila
Copy link
Member

pchila commented May 3, 2024

I redid my measurement for what we gain by removing the Non-zero metrics... log in the related PR description #4633 (comment)

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 4, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pchila
Copy link
Member

pchila commented May 9, 2024

can we test enabling synthetic source in the index_template? that may be a ~50% win

Did a quick check on the synthetic source (according to the documentation it's GA only on time_series indices) and elastic-agent metrics indices in 8.15.0 already have this setting:
image
image
image

so not much to gain there, unfortunately

@pchila
Copy link
Member

pchila commented May 9, 2024

@strawgate #2851 has a PR open and it should be merged during the current sprint...

I was wondering if the definition of done for this issue is still the one quoted below

What is the definition of done?

Agent logs at Warn level by default or switch to warn level shortly after Agent startup.

Start agent, notice no more Info level messages and an overall reduction in log volume.

According to @nimarezainia 's comment on the kibana issue elastic/kibana#180778 (comment) I think we want the initial log level to be INFO ? Or we still want to have warning as initial log level ?

Furthermore, since the last 8.14.0 BC has already started I am not sure we can cram more code modifications there so this will likely be implemented in a future release

@strawgate
Copy link
Contributor Author

The definition of done is probably something like: reduce agent monitoring log volume by 80% in normal usage.

Initially, we thought this would be done by switching the default log level as doing that is easily reversible by customers. Later the decision was made to just trim the logs present in Info.

@cmacknz
Copy link
Member

cmacknz commented May 9, 2024

Yes, I just adjusted the description and definition of done.

We can still target changes to 8.14.1 by backporting them after the 8.14.0 release.

@ycombinator
Copy link
Contributor

ycombinator commented May 23, 2024

Next step, as discussed in today's team meeting:

  • Setup Agent with a typical/common scenario: system + defend + osquery integrations + Agent logs and metrics monitoring enabled.
  • Using this setup, run Agent for a reasonable period of time (5 minutes?) so that we collect enough log data to see statistically significant patterns (see next point)
  • Figure out statistical distribution of the logs and use this to figure the most frequently-emitted logs, allowing us to focus on these towards the goal of reducing the amount of agent logs by default.

@ycombinator
Copy link
Contributor

ycombinator commented May 28, 2024

Setup Agent with a typical/common scenario: system + defend + osquery integrations + Agent logs and metrics monitoring enabled.

@strawgate Are you good with using this particular setup/profile from here on out for measuring any improvements from reducing the amount of Agent logs?

@strawgate
Copy link
Contributor Author

strawgate commented May 28, 2024

For logs-elastic_agent* we should target reduction, in this order:

  1. logs-elastic_agent.filebeat-*
  2. logs-elastic_agent.metricbeat-*
  3. logs-elastic_agent-*

We can assume all other logs are a rounding error.

For a sense of scale, we should assume:

  • a 30% reduction in elastic_agent.filebeat logs is worth a 100% reduction in elastic_agent.metricbeat logs
  • a 50% reduction in elastic_agent.metricbeat is worth a 100% reduction in elastic_agent logs.
  • Improvements to any other datastreams are functionally incidental

We should pick the framework that can best allow us to measure total volume and then reductions in the above data streams.

@ycombinator
Copy link
Contributor

ycombinator commented May 28, 2024

Followed up with Bill off-issue re: my question about coming with up a standard Agent profile we should use for benchmarking log reduction improvements. Bringing our conversation back into this issue:

@ycombinator: Are you good with using this particular setup/profile from here on out for measuring any improvements from reducing the amount of Agent logs?
@strawgate: I dont think we need defend or osquery and I think we should find whatever integrations leverage filebeat and metricbeat and enable as many as we can reasonably enable.

@cmacknz IIRC, you'd suggested in last week's team meeting that we enable defend and osquery. Could you and @strawgate work out whether we should or should not enable these as part of a standard profile for benchmarking Agent Similarly, we need a list of specific integrations that leverage Filebeat and Metricbeat because "as many as we can" could be pretty much every integration, so we need some specific narrowing criteria implied by "reasonably enable". We will resume work on this issue once we have this specific Agent profile defined.

@cmacknz
Copy link
Member

cmacknz commented May 28, 2024

Installing the system integration is all you need to get a single instance of Filebeat and Metricbeat running. System is also great because the data source is the machine agent is running on.

Running more integrations would be great but we need them to do real work to find out if they log too much, and that is much more complicated than system.

Let's just start with a basic baseline of Elastic Agent with the System integration installed to keep things as simple as possible. Most of the improvements we have brainstormed so far are testable with just that.

@strawgate
Copy link
Contributor Author

strawgate commented May 28, 2024

reasonably enable

Meant without really doing anything :)

I'll reach out to see if we have any customers that might be willing to run log data analysis and provide us a result like:
Screenshot 2024-05-28 at 4 38 22 PM

Im also wondering if we might want to setup some integrations in blank states (nothing configured) as a one-off to see if there are any producing large volumes of errors (like the nginx file has no content error in the screenshot).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

9 participants