-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: aws metadata group retries #7245
filter: aws metadata group retries #7245
Conversation
@PettitWesley Hey, I wonder if you could find some time to check out this PR. Absolutely no pressure though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwarzynski Sorry I think since it is a draft I never got notified for this.
plugins/filter_aws/aws.c
Outdated
if (ctx->tag_is_enabled[i] == FLB_TRUE) { | ||
ctx->new_keys++; | ||
flb_plg_error(ctx->ins, "Failed to get instance EC2 Tags, will not retry"); | ||
if (ret == -3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking at the code myself quickly, it looks like this is only returned if the config is invalid?
So that's the one case we don't retry?
I think I can agree with that.
For other cases, I can think of three options:
- Just retry forever on each flush and keep emitting error messages. This ensures that if a user enables tagging or fixes IMDS, Fluent Bit will start working without restart. The downside is that you get a spam of errors which might increase your log cost.
- Retry some set number of times that we track with a counter. This could be user configurable.
- Retry at some fixed interval which user can configure. So may be by default its 5 minutes or something, so you won't get more than one error message per 5 minutes. You track the last retry time in the code. If the user sets the time to -1, then may be that means no retries, just fail right away.
I like #3 and I think we want to give users the option to Require tags which could be supported with some zero or negative value which means no retries.
#3 ensures we can't spam their logs with errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial version of this PR is a 'proposal'. Minimal effort to handle 500s. In reality it is just to start the discussion about the desired solution, but most likely not the solution itself.
From looking at the code myself quickly, it looks like this is only returned if the config is invalid?
So that's the one case we don't retry?
-3
means that configuration is invalid and Fluent Bit will exit with an error code if that happens.
Yes, that's the only case we don't retry -- we simply exit the application if that happens. For all other cases, if fetching tags will fail for any reason other than invalid configuration, we won't retry fetching tags (and all flushed logs won't have the tags injected).
if a user enables tagging or fixes IMDS, Fluent Bit will start working without restart
EC2 option change for instance metadata server and tag values are only refreshed with the EC2 restart, so user would have to restart the instance anyway to have an effect (it doesn't apply for Nitro instances though). I don't have as much knowledge about the IMDS, so it might be one of the cases when it can be fixed without the restart.
- Just retry forever on each flush and keep emitting error messages. This ensures that if a user enables tagging or fixes IMDS, Fluent Bit will start working without restart. The downside is that you get a spam of errors which might increase your log cost.
Retrying on each flush might be an overkill, especially since these errors may not change. I admit, it would be the simplest approach, so that's a plus.
In my opinion, we should allow the user to configure the maximum retry rate (with sane defaults provided). I wouldn't worry about the spam of errors as there is really no reasonable alternative. If user would be able to define the maximum retry rate, indirectly it allows to control the amount of error logs. In the end, at best someone will notice these errors and fix the underlying issue.
- Retry some set number of time
It is an interesting option, but if an issue would be fixed after the defined max counter of retries, then our Filter wouldn't load the tags. I would rather expect the application to retry indefinitely, but maybe with lower rate if error is persistent. This way, it at least has the chance to fix itself.
- Retry at some fixed interval which user can configure. So may be by default its 5 minutes or something
Yeah, I also like 'Option 3'. Let's call this 'Fixed Interval Retries'.
@PettitWesley I will try to implement this option as the next version of this PR.
I may have a few more ideas/suggestions.
NIT: Abstraction Layer
First of all, I don't think we should treat EC2 Tags in a special way. EC2 Tags are only special at one angle, that EC2 must have the instance-metadata-options enabled in order for this functionality to work. In other words, we shouldn't retry 404s for EC2 Tags. All other errors should be treated equally in comparison to fetching other metadata infrmation pieces (like instance id). If we intend to implement a new retries mechanism for EC2 tags, then I would keep in mind that in the future it might be reused for other metadata information groups. That's just to say we should try to provide a decent abstraction layer for the retries, but we will see how it plays out in practice.
Retries with Exponential Backoff
In terms of the retries and proposal for 'Fixed Interval Retries'. Personally I usually consider exponential backoff whenever I attempt to add retries to a component. I wonder if it would provide some value in this case as well. The main advantage of the exponential backoff is that Fluent Bit would fix itself faster in case the underlying issue is short-lived/temporary. It might start with retries at every flush and if it will fail repeatedly, it would adjust the retry duration to the longer periods -- possibly even longer than we would specify for the Fixed Interval Retries as we had sufficient number of retries to suspect that the issue is likely still occurring.
Example where exp backoff would be better:
- [t+0s] flush, ec2 tags failed => logs without tags
- [t+1s] flush, retry ec2 tags failed => logs without tags
- [t+2s] flush, retry ec2 tags success => logs with tags
- [t+3s] flush => logs with tags
- [t+4s] flush => logs with tags
- [t+5s] flush => logs with tags
Whereas in the 'Fixed Interval Retries', assuming a default retry duration set to 5s, the logs pushed at t+{2,3,4}s would have no tags.
Of course, if the issue would be persistent, then retry duration would be adjusted accordingly. It depends on the configuration, but assuming: 0) flush interval = 1s, 1) start_retry_duration = 100ms, 2) exp = 2, 3) max_retry_duration = 15m:
[0.1, // => 1s [min allowed flush duration is 1s, so we get to do something every 1s -- can't retry sooner]
0.2, // => 1s
0.4, // => 1s
0.8, // => 1s
1.6, // => 2s
3.2, // => 4s
6.4, // => 7s
12.8, // => 13s
25.6, // => 26s
51.2, // => 52s
102.4, // => 103s = 1m43s
204.8, // => 205s = 3m25s
409.6, // => 410s = 6m50s
819.2, // => 820s = 13m40s
900, // => 900s = 15m
900, // => 900s = 15m [max duration]
...
In this example, the first 4 retries would happen every 1s. Then we would wait increasingly longer for the next retries (2 times more every time). At the end, we would wait every 15m for the next retry. Ofc, values are configurable -- this is only the example.
What do you think? Do you see some value in introducing more complexity and exponential backoff with the potential advantage of more logs with injected tags?
Allow to block pushing logs if EC2 Tags weren't fetched
For some use cases (for transparency, mine included), it might be useful to block pushing logs until Filter AWS fetches successfully all information groups. I mean, during the Plugin initialization phase, it would have to fetch every enabled information in the configuration -- not only EC2 tags, but for example also instance id. If Fluent Bit wouldn't be able to fetch correctly all configured datapoints and our special configuration would be enabled, then the init function fails and Fluent Bit should fail too. It would defer the retries to the upper levels like systemd.
The idea is that, the more instances we have, the less useful are logs without tags. At the moment, if Fluent Bit flushes logs without tags, I am not really able to attribute them to particular components and debug the issues accordingly. It might be an edge case and specialised use case, but I would propose to expose an option in the configuration that would fail the Filter AWS initialization if fetching one of the groups failed.
Let me know what you think about my suggestions. As I said, I will try to implement the 'Fixed Interval Retries' without any additional concepts (/ any of my suggestions) as the next version of the PR. However, if you agree with some of my suggestions, then we might think about adding more complexity to the PR.
(Also, sorry for late reply, I somehow missed the notifications from this PR too...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EC2 option change for instance metadata server and tag values are only refreshed with the EC2 restart, so user would have to restart the instance anyway to have an effect (it doesn't apply for Nitro instances though). I don't have as much knowledge about the IMDS, so it might be one of the cases when it can be fixed without the restart.
Ah I see... I was testing on a nitro based instance before... I thought this was how it always works.
I like both your "fixed interval retries" (simpler user config and experience) and exponential backoff retries ideas. Backoff has the downside that as your retry time grows, if you do fix the issue, it takes a while for it to take effect. Also, configuring backoff formula is probably more complicated for the users? What would be your proposed config options? For FLB backoff, we have scheduler.cap and scheduler.base to configure the max retry time and the backoff formula, we could do the same here with sane defaults. https://docs.fluentbit.io/manual/administration/scheduling-and-retries
For some use cases (for transparency, mine included), it might be useful to block pushing logs until Filter AWS fetches successfully all information groups. I mean it would have to fetch every enabled information in the configuration -- not only EC2 tags, but for example also instance id.
May be we handle this with a special zero value for the timeout time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be useful to block pushing logs until Filter AWS fetches successfully all information groups
May be we handle this with a special zero value for the timeout time?
OK, this is a good idea. Do you want me to include this behavior of retries=0 in this PR?
f5f365f
to
89990d3
Compare
@PettitWesley I pushed the first version which should satisfy the minimal requirements for 'Fixed Interval Retries' and handle 500s for EC2 tag requests. I initially set the default value for retries interval to 5s. We can make it longer (e.g. 5m). I tested my changes locally. In the local run, FluentBit got 2 logs per second in the input, whereas settings had 1s flush interval. Here are logs from the EC2 Metadata Server 'mock':
As you can see, Fluent Bit made requests for EC2 tags every 5s even though flushes where happening every 1s. If there wouldn't be any logs flushed, then the interval would be even longer (until the next flush). Please note that I introduced a concept of metadata_group, such that we can separate the concerns and define how critical each of the metadata groups is in the future. At the moment, if anything from the 'base' group fails, then we don't include any successfully fetched information in the log. However, I believe the correct approach would be to separate the concerns and either push whatever we managed to fetch, or block on retries for the metadata groups specified as required in the configuration. For this reason, I added another struct definition Let me know what you think about the configuration design for the retries seconds:
Yeah, it is. I would stick with a simpler solution for the first iteration. I can propose the backoff exp in the next PRs (if we think it's worth it).
I think we should make the configuration more explicit. In case someone wants to require EC2 Tags in the logs, we can add another configuration called What do you think? |
89990d3
to
1f9e707
Compare
@mwarzynski I like the idea of creating a "required" tag group which causes the filter to fail to send logs unless certain metadata can be fetched.
Would you be okay with losing the logs instead? That sentence should read "unless" not "until". Because that's what will happen if the filter won't pass through any logs unless a value can be fetched. I think that's probably a very niche use case? |
@PettitWesley In my personal opinion, I would rather submit logs without the necessary information than lose them altogether. Of course, this decision may vary depending on the specific use case. However, I likely would not be comfortable with losing the logs completely. It is preferable to have the logs without certain metadata than to lose them entirely. Isn't it dependent on the type of input? In other words, inputs can vary in terms of source persistence:
I would assume that in the scenario where required metadata is not fetched, the persistent input logs (2.) would not be lost. At the same time, Filter probably has no reliable way to distinguish source persistence, so given that we are not comfortable with losing the non-persistent logs, it might not be productive to introduce the
It most likely is. Furthermore, I would like to emphasize that it would address a specific scenario that is not expected to occur frequently, if at all. |
So tail input keeps a file offset position. Once it reads the logs, then it advances offset. The tail input then considers the data read. If a filter then drops it all, the tail input still considers it read. Take a look at code in |
plugins/filter_aws/aws.c
Outdated
{ | ||
int ret; | ||
|
||
ctx->metadata_groups[FLB_FILTER_AWS_METADATA_GROUP_TAGS].last_execution = time(NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: last_fetch or last_attempt I feel makes more sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I renamed this field to last_fetch_attempt
.
plugins/filter_aws/aws.c
Outdated
@@ -1048,6 +1114,12 @@ static struct flb_config_map config_map[] = { | |||
"if both tags_include and tags_exclude are specified, configuration is invalid" | |||
" and plugin fails" | |||
}, | |||
{ | |||
FLB_CONFIG_MAP_INT, "tags_retry_interval_s", "5", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should pick a default that's higher than 5? That's 12 calls per minute, which is a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the default value to 300 (5m).
Signed-off-by: Mateusz Warzyński <[email protected]>
Signed-off-by: Mateusz Warzyński <[email protected]>
group_tag and group_base instead of array of groups Signed-off-by: Mateusz Warzyński <[email protected]>
Signed-off-by: Mateusz Warzyński <[email protected]>
add test cases for injecting following information: - instance_id - instance_type - private_ip - vpc_id - ami_id - account_id - hostname - az Signed-off-by: Mateusz Warzyński <[email protected]>
'retry_interval_s' is specified in the filter aws configuration Signed-off-by: Mateusz Warzyński <[email protected]>
cce345d
to
73bd05c
Compare
I rebased my changes on the latest origin:master. It compiled successfully on my machine.
|
I used the config that you have in the description and got a single attempt to request tags, which fails, since its not enabled, then it proceeded adding other metadata to logs:
It didn't retry ever again for the tags and get a failed response again from what I can tell. I think I'm not understanding something here. I configured the |
plugins/filter_aws/aws.c
Outdated
metadata_fetched = FLB_FALSE; | ||
} | ||
|
||
if (metadata_fetched) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Eduardo says to always use if (metadata_fetched == FLB_TRUE)
as FLB code style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand. IMDS returns 404 when tagging is not enabled, which counts as a non-retryable error. I think this PR and code is fine as is. However, remember the nitro case you brought up. Where a user can enable tags while FLB is running without restarting the instance. I think those users are the ones who might want this retry feature with a AFAICT, there are two features here:
What do you think?
So if we ship this as is, the use case is that it will retry if IMDS returns a 500? I think that's very useful and thank you for adding it. I am not sure if it justifies a new config option. I suppose configurability is better than nothing, but I am struggling to imagine how users pick their desired value. What impacts the value you want? Why did you choose 5s as the default? |
Signed-off-by: Mateusz Warzyński <[email protected]>
@PettitWesley Hey! Thank you for checking it out.
Yes, correct. Fluent Bit also produced a warning about this case (taken from your logs):
Yes. From the Fluent Bit's behaviour perspective, there are two main goals/deliverables with this PR:
One of value propositions here, as perceived by myself, would be to handle errors related to fetching aws metadata groups. I had experienced issues in the past with our 'in-house component' due to EC2 instances not being able to connect to 'aws metadata server'. This was difficult to debug, because all errors concluded with message that aws metadata server is 'not reachable'. Ultimately we didn't find the root cause of this issue. We do have a bug in the backlog and are monitoring if this will happen again. (We didn't find anything helpful for troubleshooting in the AWS documentation, but maybe we missed something.) To be honest, this PR would probably not help us to avoid this bug as retries would still consistently hit the 'not reachable' error. It doesn't imply though this PR won't help in some other cases, as our component might have automatically recovered from due to the 'retry strategy' already in place, so that's why I might not know about such situations... In general, I believe there might be various technical issues while performing fetches for aws metadata group. Therefore adding simple retries to mitigate these errors seems like a good solution.
I assume your questions was regarding the following topics. (If there are additional things you meant, feel free to ask.)
I believe retries should be enabled by default for all use cases (not only Nitro EC2 instances, but also usual EC2 instances). It's the same line of thinking: there might be an issue with fetching metadata groups => it would be nice to automatically retry fetching the metadata groups and ensure we do our best to push enriched logs. Retry attempts are made worst case at every flush, so even if retries will fail at each attempt, there should be no noticeable performance effects overall. (Well, okay, it's a hypothesis, we might confirm this via a real world test.) I struggle to see a good argument in favor of disabling the retries.
You can make a decision. Introducing complexity in a popular project isn't the same as doing so in a hobby project. You will have to maintain this config options in the future (and take care about the communication with the users). I guess this means we want to add a new option only if we absolutely need to.
Note that current PR does not support the use case of dynamic changes in the aws metadata groups values. Let's imagine that we actually have such solution of refreshing the aws metadata groups (tags) in place for Nitro EC2 instances. Unfortunately no hard assumption can be made when Fluent Bit (FB) will refresh the log labels (as those refreshes may fail). My line of thinking is: in order to ensure logs always have expected labels, the user's solution must make the components running inside the EC2 Nitro instance produce logs with those expected critical labels. In other words, the design for adding labels to the logs would be such that Nitro 'dynamic' tags are not 'critical' to the observability of the solution (those labels are only 'nice to have'). This implies that re-fetches of dynamic logs could be done on the order of minutes, so my proposal for this case would be
Yes, it will retry for 500 errors, 'not reachable' errors, or anything that's not expected.
I picked So, what decision do we want to make? Do we remove the config option for Well, this comment is pretty long. If you prefer me to be more concise, just let me know. (After all, those LLMs might be useful for something). |
@mwarzynski I've merged this into master. |
Thank you! @PettitWesley Do we want to adjust the documentation of the config parameters? fluent/fluent-bit-docs#1138 |
@mwarzynski we must wait to merge the docs until this is released. Please watch the release notes or tags for a new one to be cut. |
PR adds handling errors received while fetching metadata groups in the Filter AWS functionality. It also allows to define the retries interval if such scenario occurs.
Our current behaviour is to not inject any AWS metadata if fetching just one piece of information fails. It might be not the best approach as most likely we would like to inject as much as we can. In other words, it would be better to have some additional AWS labels, if we can fetch them, rather than nothing.
For example, on current
master
, if Instance Metadata server misbehaves and our implementation fails to fetch EC2 tags, then output logs won't contain any AWS Metadata labels. Even the basic ones like EC2 Instance ID. Logs pushed in such a way might be effectively useless as system won't be able to attribute them to particular instance.Proposed changes in this PR have an effect in scope of fetching all metadata groups (and a failure to do so). For example, if fetching EC2 Tags fails, then other AWS metadata information should still be injected into the logs, at least under the assumption that Filter AWS is able to fetch them without any errors. Retries for fetching the metadata groups will be done at the interval not smaller than the specified
retry_interval_s
.PR is related to the following comment: #7129 (comment).
Yes, we handle 404 errors gracefully. 404 has no effect on other AWS metadata. In case of 404, we just mark tags as fetched and proceed (i.e. it is the same case when instance has 0 tags).
Example configuration
Debug log output
Valgrind run
Enter
[N/A]
in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-test
label to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.