-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the amount the agent logs by default #4252
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
I worry about doing this until elastic/kibana#158861 is implemented in case it turns out we forgot to convert some critical information to the warn level and it is needed for debugging. I don't want users with support cases having to manually revert to the info level on an agent by agent basis. There are definitely a lot of low value logs at the info level today, so what we could do is identify those and improve/remove/convert them to debug. I think the spirit of this issue is to reduce the cost of storing agent logs, we can likely make significant progress on that without having to actually change the default log level. |
Thx for raising this @cmacknz. |
@pierrehilbert We still have time to get all this in before FF. I think it's important for our users to have these changes. I will work with the Fleet team to make the UI changes that is required. But I suggest this work is not blocked and we move forward. This is not a big change (code wise not impact) @cmacknz We have the policy override API available. Could we not use that in case there's an issue? |
This was not part of current sprint and the FF is on Tuesday. |
No I don't think agent respects log level changes from the policy right now. I also think this issue is prescribing a solution instead of stating the problem. I think we want to send fewer logs by default at all levels. Changing the default level to warn is just a quick way to reduce log volume, but also by removing critical information. There is information at the info level that we use for debugging constantly that makes no sense as a warning message, for example the 30s output statistic summaries would be lost making performance problems impossible to see in diagnostics. We could put this at the warning level, but they aren't really warnings. |
Possible solution for this is to introduce an |
@pierrehilbert the Fleet issue is scheduled for sp27. It is dependent on #2851 so we have a circular situation. #2851 is indeed sp27. So I'm asking for 2851 to be completed. @cmacknz Fleet issue and #2851 will set the default to Warn for every policy. Which will address the concern. We can use this issue to work on readjusting which category various logs belong to which is a longer process. |
This issue has to be done first before we can change the default level to warn. We can add the ability to change the log level on a policy basis, and customers that care can opt in to that, but I don't want to change the default until we are sure the default has the information we want.. If we change the default log level to warn without adjusting what is logged at the warn level, then every single support case will have the additional step "adjust the log level back to info for the affected agents". This adds friction and will increase how long it takes to resolve support escalations. I don't think this is a good thing. |
thanks @cmacknz. then I suggest we arrange the deck chairs since the other two issues are currently slated for completion in sp27. This issue shouldn't have been marked as blocked by elastic/kibana#158861 - i understand why it was done as we need a way to quickly go back to the old default value. But either way as you said this issue has to be completed first. |
Yes, I retitled this to "reduce the amount the agent logs" since I don't think it matters what the log level actually is. The point is we can log less than we currently do by default and we need to investigate how to fix that. Whether we are at info or warn doesn't matter, if we can get info to remain useful and log as much as warn does today we don't need to change the default log level at all. |
could the 30s output statistics messages just be captured by internal agent metricbeat monitoring? or why is it a log line? |
So it can be in diagnostics captured locally from the agent. Often that is all we get when debugging user issues and we need the data they include, specifically we often need the history they provide over time. There are other ways to solve the problem the 30s metric summaries in the logs are solving, but we can't just get rid of them without some replacement for the data they provide. We could stop shipping those metrics summaries to fleet though by adding a drop processor in the monitoring filebeat. |
As per discussions, moving this issue to sprint 28 so it can be done in the 8.14 time frame. |
@pchila thanks for working on the PR. Would it be possible to get the final benchmark on this and the metricset changes to see what is the percentage reduction all together for default use cases. thank you |
@nimarezainia, @pchila asked ES team for some advice to do this kind of benchmark. For now, only the metricset change has been merged so you won't get more than this one. |
The data stream API can tell us the size the logs data streams currently take on disk. https://www.elastic.co/guide/en/elasticsearch/reference/current/data-stream-stats-api.html The main problem will be in creating properly comparable data sets. We need to make sure we cover the same time range for the same operations to avoid doing size calculations where one of the datasets naturally contains more logs because it did more work, or covers a longer time period. I think we almost want something like the Filebeat We could take a similar approach and run two agents on two Ubuntu VMs started at approximately the same time, pointed at two different ES clusters, then compare the data stream sizes after a period of time. A third way would be to have the agent log to a file for some well defined period of time, then truncate to align the final timestamps to match, and ingest the results into ES separately and compare the sizes. |
@cmacknz I am actually working on a fourth way (I call this the "salami slice index" method) :
Edit: forgot to mention, I am preparing a small script for setting up the templates, indices etc (if useful I could commit it in the elastic-agent repo) but testing the method I noticed that removing the periodic |
can we test enabling synthetic source in the index_template? that may be a ~50% win |
I redid my measurement for what we gain by removing the |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Did a quick check on the synthetic source (according to the documentation it's GA only on so not much to gain there, unfortunately |
@strawgate #2851 has a PR open and it should be merged during the current sprint... I was wondering if the definition of done for this issue is still the one quoted below
According to @nimarezainia 's comment on the kibana issue elastic/kibana#180778 (comment) I think we want the initial log level to be INFO ? Or we still want to have Furthermore, since the last 8.14.0 BC has already started I am not sure we can cram more code modifications there so this will likely be implemented in a future release |
The definition of done is probably something like: reduce agent monitoring log volume by 80% in normal usage. Initially, we thought this would be done by switching the default log level as doing that is easily reversible by customers. Later the decision was made to just trim the logs present in Info. |
Yes, I just adjusted the description and definition of done. We can still target changes to 8.14.1 by backporting them after the 8.14.0 release. |
Next step, as discussed in today's team meeting:
|
@strawgate Are you good with using this particular setup/profile from here on out for measuring any improvements from reducing the amount of Agent logs? |
For
We can assume all other logs are a rounding error. For a sense of scale, we should assume:
We should pick the framework that can best allow us to measure total volume and then reductions in the above data streams. |
Followed up with Bill off-issue re: my question about coming with up a standard Agent profile we should use for benchmarking log reduction improvements. Bringing our conversation back into this issue:
@cmacknz IIRC, you'd suggested in last week's team meeting that we enable defend and osquery. Could you and @strawgate work out whether we should or should not enable these as part of a standard profile for benchmarking Agent Similarly, we need a list of specific integrations that leverage Filebeat and Metricbeat because "as many as we can" could be pretty much every integration, so we need some specific narrowing criteria implied by "reasonably enable". We will resume work on this issue once we have this specific Agent profile defined. |
Installing the system integration is all you need to get a single instance of Filebeat and Metricbeat running. System is also great because the data source is the machine agent is running on. Running more integrations would be great but we need them to do real work to find out if they log too much, and that is much more complicated than system. Let's just start with a basic baseline of Elastic Agent with the System integration installed to keep things as simple as possible. Most of the improvements we have brainstormed so far are testable with just that. |
Describe the enhancement:
We'd like an audit to occur on the current logged messages to make sure that the information at the default INFO log level is useful and is valuable when shipped to Fleet.
Describe a specific use case for the enhancement or feature:
10-15% of data produced by Agent is internal logs and metrics, we'd like to reduce the volume of Agent internal monitoring as excessive production of this data consumes significant disk space within customer environments.
What is the definition of done?
We have measurably reduced the amount of logs agent generates by default. Ideally we will have reduced agent monitoring log volume by 80% in normal usage and have run out of obvious ideas or superfluous messages to remove.
The text was updated successfully, but these errors were encountered: