Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance PagerDuty notifier with retry configuration #241

Open
StephenWeber opened this issue Nov 27, 2018 · 6 comments
Open

Enhance PagerDuty notifier with retry configuration #241

StephenWeber opened this issue Nov 27, 2018 · 6 comments

Comments

@StephenWeber
Copy link
Contributor

Current behavior: when there is an error making requests of the PagerDuty API, the client by default does not retry the request. consul-alerts logs the error.

Preferred behavior: The underlying library providing PagerDuty support has logic built-in to retry requests when there is an error. By default, it does not perform retries, but through configuration values it can be set to do so. We would like to provide a way to optionally configure the client with these settings.

Primary benefit for consul-alerts users: PagerDuty returns an error when they rate-limit your service key. Their documentation says that the correct thing to do is to retry the request after a period of time.

The client fields in question are:

client.MaxRetry = 5 // set max retries to 5 before failing, Defaults to 0.
client.RetryBaseInterval = 5 // set first retry to 5s. Defaults to 10s.

We suggest two optional fields in the PagerDuty notifier configs that would look like this in Consul:

consul-alerts/config/notifiers/pagerduty/max-retry
consul-alerts/config/notifiers/pagerduty/retry-base-interval

Since existing customer configurations don't have those keys, the code's default should be to retain the existing behavior. That means that if those keys are not present, that the PagerDuty client should not retry on errors.

The basic components of our suggested change are:

  1. adding the stated keys as optional parts of the PagerDutyNotifier struct and its Unmarshal-er
  2. setting either of those values on the client if their key is present
  3. adding a test for PagerDuty configuration with and without those values present.

I have some example code that I could also paste here to illustrate these steps, or could supply as a PR.

@fusiondog
Copy link
Collaborator

Sounds like a good improvement. I would appreciate the PR and seems like it should be straight forward to review and approve.

@StephenWeber
Copy link
Contributor Author

StephenWeber commented Nov 29, 2018

@fusiondog comprehensive PR attached -- in addition to unit tests, I've also used a local instance to try out variations on set/unset values. I'm up for any adjustments to better match local source or config conventions.

@nh2
Copy link
Contributor

nh2 commented Dec 1, 2018

Is it possible to make it retry infinitely?

@StephenWeber
Copy link
Contributor Author

@nh2 the way the underlying gopherduty client is written, it cannot be retried indefinitely. However, you can set a high # for max-retries and a reasonable retry-base-interval that will exponentially back off.

For instance, max-retry=15 and retry-base-interval=10 will start with a 10-second delay for the first retry, and will end up with 3.8 days between the 14th and 15th retry.

My own experiments show that so long as a retry happens after 60 seconds, it's likely to succeed. It's hard to determine what an appropriate amount of retries would be, though the current rate limits are about 150 requests-per-minute - so as long as you will retry past number of requests / 150 minutes, you'll likely be successful.

@StephenWeber
Copy link
Contributor Author

@fusiondog @darkcrux there is also a PR against the gopherduty library that will allow 403 (Rate limit response) to be treated as a retry-able error: darkcrux/gopherduty#3

@nh2
Copy link
Contributor

nh2 commented Dec 12, 2018

@StephenWeber Thanks for the detail, that is good to know; infinite retry would still be best for simplicity so that we don't have to rely on specific timings or logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants