-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay ARP record closures for devices that have been down for a while #2913
Conversation
We're about to introduce cleaning operations that do not entail actual record deletion, just record updating. For that reason, it's best to change the "deleter" concept in the code to a more generic "cleaner" concept.
This new command can be used to explicitly close ARP records from netboxes that have been down for too long.
This rule has unintended and bad side-effects. It will cause ARP records to be immediately closed when NAV experience intermittent ICMP packet losses from a router, specifically the kind that are so short-lived that NAV never actually declares the device as down through a new `alerthist` record. `pping` will, however, use `netbox.up` as its persistent internal up/down status, which can cause ARP records to be closed as a result of non-serious flapping.
Lacks tests, and possibly documentation. These will be added. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 5.10.x #2913 +/- ##
==========================================
+ Coverage 60.21% 60.27% +0.06%
==========================================
Files 601 602 +1
Lines 43981 44102 +121
==========================================
+ Hits 26481 26581 +100
- Misses 17500 17521 +21 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this, only one question:
Why did you chose the variable name expiry_type
instead of table
? I agree that table
isn't ideal, but don't understand the new name.
And please have another look at the changelog fragment. I can't make a suggestion for improvement because I do not get what it's supposed to say 😁
Because naming things is hard, and
Gargh... comes down to usage of backticks as markdown markup inside |
Test coverage mostly achieved by 410fd80. News fragment fixed and existing docs about the |
Fair, I cannot think of a better name myself. Then it might be nice to have a comment next to the definition in the base class to clarify what it should be used for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, only my comment about documenting expiry_type
a bit better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya seems fine, definitely was easier to go through commit by commit with all the renaming cluttering things and making it difficult to discern the real changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now
This adds a default navclean job every five minutes to close ARP records of devices that have been down for more than 30 minutes. Needs to be a separate job line, since it doesn't use the built-in default of 6 month expiry interval.
Use the expiry type name without necessarily referring to it as a table. As discussed in code review.
As mentioned in code review, the double downsince name is slightly confusing, so this renames the intermediate lookup table to something hopefully more understandable.
fa115b4
to
d12fcbc
Compare
Instead of immediately closing the ARP records read from a netbox as soon as its pping up/down status flaps, we should wait for the netbox to stay down for longer. At this point, we cannot accomplish this usefully as a PostgreSQL rule or trigger, so this suggests removing the old database rule and replacing it with a new argument to the already-existing
navclean
command. In this way, the expiry interval is even configurable for end users (or can be disabled entirely).The terminology used in the
navclean
code was heavily centered around the concept on deletion, so this PR also introduces a renaming of this concept to just the generic "cleaning", since the new cleaner type doesn't actually delete records, it just sets a timestamp in them.Fixes #2910