Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK to provide canonical retriable given error #60

Open
agriffin208 opened this issue Nov 13, 2020 · 3 comments
Open

SDK to provide canonical retriable given error #60

agriffin208 opened this issue Nov 13, 2020 · 3 comments

Comments

@agriffin208
Copy link

agriffin208 commented Nov 13, 2020

See influxdata/telegraf#8404

The wavefront output plugin for telegraf was continuously treating all errors as retryable.

Recently added in the PR mentioned above was functionality for the wavefront output to not retry for invalid (null) metric names.

However, seeking the wavefront-sdk-go to provide canonical set of which errors are retryable and which are not.

@agriffin208 agriffin208 changed the title SDK to provide cannonical SDK to provide canonical retriable given error Nov 13, 2020
@LukeWinikates
Copy link
Contributor

I'd like to close this issue out in the near future.

Questions on my mind:

  1. Because the wavefront sender buffers errors that it thinks are retryable (always?), maybe the SDK only bubbles up non-retryable errors at this point. Put another way, I think it would be worthwhile for us to take a deeper look at the different error cases from SendMetric and see what we conclusions we can draw about how this should work.

I guess it was just the one question 🤣

@LukeWinikates
Copy link
Contributor

LukeWinikates commented Sep 1, 2023

Quick observations:

SendMetric can return an error if:

  • the metric name is empty (via metric.Line)
  • the channel buffer is full (via trySendWith and *RealLineHandler.HandleLine)

Case 1: empty metric names

✅ We know that @agriffin208's PR to telegraf is designed to exclude the first case

Case 2: Buffer errors

With the second case, we'll have to dig deeper.

My intuition is that if the WF buffer is full, then telegraf starts filling up its buffer instead. I don't think the SDK expects a second layer of buffering to happen above it, but since Telegraf is doing this, that seems fine - we could pull @agriffin208's solution into our package and try to give it a reasonable name.

We may not ever add additional error scenarios to SendMetric, but we should have an implementation that won't lead to odd behavior in telegraf if we do. Maybe this means that our logic should make buffering errors retryable, but future errors will be non-retryable by default.

Case 3: Flush errors

note: http and auth errors happen in Flush, and in Telegraf usage:

  • Flush by default happens quietly in the background
  • but some proactive flushing can happen in Write, to try to unblock the SDK's buffer
  • and a blocking flush call happens if ImmediateFlush = true

3a: partially flushed batches

It looks to me like in these cases, Telegraf might re-send in full a partially-sent batch:

  • Write is called with a batch of size N
  • Write loops over batch, calling SendMetric, and fails on metric A
  • if the error is retryable (a buffering error) , Write calls sender.Flush
  • if Flush succeeds, Write continues with the rest of the batch.
  • but if Flush returns an error, Telegraf retries the entire batch - which may be large, and which has probably been partially sent already

3b: non-retryable auth errors treated as retryable

Some of these are non-retryable auth errors. If Telegraf re-attempts these, that could lead to an infinite retry loop.

3c: potentially retriable, non-auth flush errors from *RealLineHandler.report - but which the SDK is already retrying

Flush errors can also happen for network timeouts.
I think this leads to double-buffering in Telegraf. report will make a best-effort attempt to re-queue the metric, but will move on if the buffer is full. However, the returned error will probably make Telegraf retry the entire batch, resulting in double-reporting of metrics.

@LukeWinikates
Copy link
Contributor

@astp-okta I saw that you made some changes in this area of the Telegraf Wavefront output plugin recently. No pressure, but if you have thoughts on the above, I'd love to hear your perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants