You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There can be a conflicting situation in which a device is not reachable on the management IP but is sending metrics succesfully to the server.
Due to the recovery detection feature, this generates additional load on the server because as soon as metrics or checksum requests are received, the system schedules a ping because it belive it will be able to reach the device and hence set the status back to OK, but that won't happen.
If many devices are in this situation, the monitoring queue can grow indefinitely until consuming all the available memory, at that point the server will crash.
We need to devise a way to spot these situations and set the status to "PROBLEM".
In this case, the ping check should not set the status to CRITICAL even if it cannot ping, unless no metrics were received for more than 10 minutes.
The device recovery mechanism should not be triggered if the status of the device is not critical.
Maybe we could solve this by simply modifying the ping check to look whether the device has been receiving monitoring metrics before deciding to set the status to CRITICAL or PROBLEM.
The text was updated successfully, but these errors were encountered:
There can be a conflicting situation in which a device is not reachable on the management IP but is sending metrics succesfully to the server.
Due to the recovery detection feature, this generates additional load on the server because as soon as metrics or checksum requests are received, the system schedules a ping because it belive it will be able to reach the device and hence set the status back to OK, but that won't happen.
If many devices are in this situation, the monitoring queue can grow indefinitely until consuming all the available memory, at that point the server will crash.
We need to devise a way to spot these situations and set the status to "PROBLEM".
In this case, the ping check should not set the status to CRITICAL even if it cannot ping, unless no metrics were received for more than 10 minutes.
The device recovery mechanism should not be triggered if the status of the device is not critical.
Maybe we could solve this by simply modifying the ping check to look whether the device has been receiving monitoring metrics before deciding to set the status to CRITICAL or PROBLEM.
The text was updated successfully, but these errors were encountered: