Skip to content
This repository has been archived by the owner on Jan 21, 2022. It is now read-only.

Bridge healthcheck stays in failed state until a successful event is received #1015

Open
amhuber opened this issue Aug 23, 2018 · 3 comments

Comments

@amhuber
Copy link

amhuber commented Aug 23, 2018

The bridge healthcheck logic at https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/utils/bridge/src/healthchecker.js sets isFailing any time a failure event is received, but that state will only ever get changed by a subsequent success event. If a failure has occurred and then a success event isn't received for a lengthy period (for example, no apps have been stopped or started) then the healtcheck will stay in a failed state permanently until the bridge is restarted.

It would make more sense to reset isFailing after the threshold has expired.

@cf-gitbot
Copy link
Collaborator

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/160008906

The labels on this github issue will be updated when the story is started.

@denicaM
Copy link
Contributor

denicaM commented Nov 15, 2018

Hello @amhuber,
Looking at the code I can tell that a failure can occurs only when 'usage.failure' is emitted. This happens when a particular Event from CloudController it not accepted by Abacus. In this case the bridge is retrying this same Event, no other events form Cloud Controller are taken into account. The healthcheck is staying in a failed state, it will turn into healthy state when the Event is successfully accepted. And then the bridge will read other events (if have any) from Cloud Controller.
The code has been refactored since the time of creation the issue. Can you please describe how did you reproduce your scenario.

@amhuber
Copy link
Author

amhuber commented Nov 15, 2018

The relevant code was just moved in the refactor but it doesn't appear to have changed significantly. As far as I can see, this is what is happening:

Where this is an issue is in environments where we don't have any services in CF. If there is an issue with the CC then a failure event can be triggered in the abacus-services-bridge, but since there are no services there will never be an onSuccess event, so the bridge healthcheck reports as failed forever until the bridge is restarted. The only resolution on our end is to just not monitor the abacus-services-bridge healthcheck in environments that don't have any services, but it still seems like the logic could be improved in the healthcheck.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants