Bridge healthcheck stays in failed state until a successful event is received #1015

amhuber · 2018-08-23T23:50:28Z

The bridge healthcheck logic at https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/utils/bridge/src/healthchecker.js sets isFailing any time a failure event is received, but that state will only ever get changed by a subsequent success event. If a failure has occurred and then a success event isn't received for a lengthy period (for example, no apps have been stopped or started) then the healtcheck will stay in a failed state permanently until the bridge is restarted.

It would make more sense to reset isFailing after the threshold has expired.

cf-gitbot · 2018-08-23T23:50:29Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/160008906

The labels on this github issue will be updated when the story is started.

denicaM · 2018-11-15T08:18:54Z

Hello @amhuber,
Looking at the code I can tell that a failure can occurs only when 'usage.failure' is emitted. This happens when a particular Event from CloudController it not accepted by Abacus. In this case the bridge is retrying this same Event, no other events form Cloud Controller are taken into account. The healthcheck is staying in a failed state, it will turn into healthy state when the Event is successfully accepted. And then the bridge will read other events (if have any) from Cloud Controller.
The code has been refactored since the time of creation the issue. Can you please describe how did you reproduce your scenario.

amhuber · 2018-11-15T16:07:09Z

The relevant code was just moved in the refactor but it doesn't appear to have changed significantly. As far as I can see, this is what is happening:

In https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/utils/healthmonitor/src/index.js#L10-L17 any failure will set isFailing to true
The only way to change isFailing to false is for the onSuccess event to fire (https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/utils/healthmonitor/src/index.js#L19-L22)
The health check will report as failed if isFailing is true (https://github.com/cloudfoundry-incubator/cf-abacus/blob/master/lib/utils/healthmonitor/src/index.js#L29)

Where this is an issue is in environments where we don't have any services in CF. If there is an issue with the CC then a failure event can be triggered in the abacus-services-bridge, but since there are no services there will never be an onSuccess event, so the bridge healthcheck reports as failed forever until the bridge is restarted. The only resolution on our end is to just not monitor the abacus-services-bridge healthcheck in environments that don't have any services, but it still seems like the logic could be improved in the healthcheck.

cf-gitbot added the unscheduled label Aug 23, 2018

hsiliev added the enhancement label Aug 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bridge healthcheck stays in failed state until a successful event is received #1015

Bridge healthcheck stays in failed state until a successful event is received #1015

amhuber commented Aug 23, 2018

cf-gitbot commented Aug 23, 2018

denicaM commented Nov 15, 2018

amhuber commented Nov 15, 2018

Bridge healthcheck stays in failed state until a successful event is received #1015

Bridge healthcheck stays in failed state until a successful event is received #1015

Comments

amhuber commented Aug 23, 2018

cf-gitbot commented Aug 23, 2018

denicaM commented Nov 15, 2018

amhuber commented Nov 15, 2018