Delay ARP record closures for devices that have been down for a while #2913

lunkwill42 · 2024-05-14T13:29:57Z

Instead of immediately closing the ARP records read from a netbox as soon as its pping up/down status flaps, we should wait for the netbox to stay down for longer. At this point, we cannot accomplish this usefully as a PostgreSQL rule or trigger, so this suggests removing the old database rule and replacing it with a new argument to the already-existing navclean command. In this way, the expiry interval is even configurable for end users (or can be disabled entirely).

The terminology used in the navclean code was heavily centered around the concept on deletion, so this PR also introduces a renaming of this concept to just the generic "cleaning", since the new cleaner type doesn't actually delete records, it just sets a timestamp in them.

Fixes #2910

We're about to introduce cleaning operations that do not entail actual record deletion, just record updating. For that reason, it's best to change the "deleter" concept in the code to a more generic "cleaner" concept.

This new command can be used to explicitly close ARP records from netboxes that have been down for too long.

This rule has unintended and bad side-effects. It will cause ARP records to be immediately closed when NAV experience intermittent ICMP packet losses from a router, specifically the kind that are so short-lived that NAV never actually declares the device as down through a new `alerthist` record. `pping` will, however, use `netbox.up` as its persistent internal up/down status, which can cause ARP records to be closed as a result of non-serious flapping.

lunkwill42 · 2024-05-14T13:30:45Z

Lacks tests, and possibly documentation. These will be added.

codecov · 2024-05-14T13:33:28Z

Codecov Report

Attention: Patch coverage is 97.67442% with 1 line in your changes missing coverage. Please review.

Project coverage is 60.27%. Comparing base (90ead68) to head (d12fcbc).
Report is 76 commits behind head on 5.10.x.

Files	Patch %	Lines
python/nav/bin/navclean.py	97.67%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           5.10.x    #2913      +/-   ##
==========================================
+ Coverage   60.21%   60.27%   +0.06%     
==========================================
  Files         601      602       +1     
  Lines       43981    44102     +121     
==========================================
+ Hits        26481    26581     +100     
- Misses      17500    17521      +21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-05-14T13:34:42Z

Test results

    12 files     12 suites 11m 48s ⏱️
3 326 tests 3 326 ✔️ 0 💤 0 ❌
9 453 runs 9 453 ✔️ 0 💤 0 ❌

Results for commit d12fcbc.

♻️ This comment has been updated with latest results.

johannaengland

I'm happy with this, only one question:
Why did you chose the variable name expiry_type instead of table? I agree that table isn't ideal, but don't understand the new name.

And please have another look at the changelog fragment. I can't make a suggestion for improvement because I do not get what it's supposed to say 😁

lunkwill42 · 2024-05-15T13:35:31Z

I'm happy with this, only one question: Why did you chose the variable name expiry_type instead of table? I agree that table isn't ideal, but don't understand the new name.

Because naming things is hard, and table obviously became useless as a name once it no longer necessarily contains an actual table name. I might have preferred type as a nicely generic name, but I don't like shadowing Python built-ins. Any better suggestions?

And please have another look at the changelog fragment. I can't make a suggestion for improvement because I do not get what it's supposed to say 😁

Gargh... comes down to usage of backticks as markdown markup inside "" quotes in my shell. Some command got executed by the shell, produced a weird error and left that part of the news fragment empty - and I didn't even notice it (too busy trying to figure out why I got the weird error message!) 😃

lunkwill42 · 2024-05-15T13:46:13Z

Test coverage mostly achieved by 410fd80. News fragment fixed and existing docs about the dbclean cron job were updated.

johannaengland · 2024-05-16T06:28:41Z

I'm happy with this, only one question: Why did you chose the variable name expiry_type instead of table? I agree that table isn't ideal, but don't understand the new name.

Because naming things is hard, and table obviously became useless as a name once it no longer necessarily contains an actual table name. I might have preferred type as a nicely generic name, but I don't like shadowing Python built-ins. Any better suggestions?

Fair, I cannot think of a better name myself. Then it might be nice to have a comment next to the definition in the base class to clarify what it should be used for.

johannaengland

Looks good, only my comment about documenting expiry_type a bit better.

stveit

ya seems fine, definitely was easier to go through commit by commit with all the renaming cluttering things and making it difficult to discern the real changes

python/nav/bin/navclean.py

The base branch was changed.

stveit

Looks good to me now

This adds a default navclean job every five minutes to close ARP records of devices that have been down for more than 30 minutes. Needs to be a separate job line, since it doesn't use the built-in default of 6 month expiry interval.

Use the expiry type name without necessarily referring to it as a table. As discussed in code review.

As mentioned in code review, the double downsince name is slightly confusing, so this renames the intermediate lookup table to something hopefully more understandable.

lunkwill42 added 5 commits May 14, 2024 14:47

Remove obsolete import

a2c3568

Improve command line help and docstrings

7208968

Clean up navclean terminology!

2fe8a12

We're about to introduce cleaning operations that do not entail actual record deletion, just record updating. For that reason, it's best to change the "deleter" concept in the code to a more generic "cleaner" concept.

Introduce ARP record closer to navclean

3f4c7ad

This new command can be used to explicitly close ARP records from netboxes that have been down for too long.

lunkwill42 self-assigned this May 14, 2024

lunkwill42 requested review from hmpf, stveit and johannaengland May 14, 2024 13:30

lunkwill42 marked this pull request as ready for review May 15, 2024 07:24

lunkwill42 added the bug label May 15, 2024

johannaengland mentioned this pull request May 15, 2024

Remove Dict, List, Tuple and Type imports from Typing #2909

Merged

johannaengland reviewed May 15, 2024

View reviewed changes

johannaengland previously approved these changes May 16, 2024

View reviewed changes

stveit reviewed May 16, 2024

View reviewed changes

python/nav/bin/navclean.py Outdated Show resolved Hide resolved

python/nav/bin/navclean.py Outdated Show resolved Hide resolved

lunkwill42 changed the base branch from master to 5.10.x May 16, 2024 14:07

stveit approved these changes May 21, 2024

View reviewed changes

lunkwill42 added 6 commits May 21, 2024 15:21

Add news fragment for 2913

3b8fa48

Close netbox ARP records after 30min downtime

1b50d1a

This adds a default navclean job every five minutes to close ARP records of devices that have been down for more than 30 minutes. Needs to be a separate job line, since it doesn't use the built-in default of 6 month expiry interval.

Test run the --close-arp subcommand

8c8f2a3

Document the addition to dbclean job

40ef9e8

Stop referring to tables

d4ea5c1

Use the expiry type name without necessarily referring to it as a table. As discussed in code review.

Clarify intermediate table name

d12fcbc

As mentioned in code review, the double downsince name is slightly confusing, so this renames the intermediate lookup table to something hopefully more understandable.

lunkwill42 force-pushed the bugfix/arp-closure-on-netbox-packet-loss branch from fa115b4 to d12fcbc Compare May 21, 2024 13:21

johannaengland approved these changes May 22, 2024

View reviewed changes

lunkwill42 merged commit 0f81276 into 5.10.x May 22, 2024
19 checks passed

lunkwill42 deleted the bugfix/arp-closure-on-netbox-packet-loss branch May 22, 2024 08:44

This was referenced May 30, 2024

[BUG] All ARP records associated with a router are immediately closed when there is a short ICMP echo packet loss #2910

Closed

Reinstate database rule netbox_close_arp and delete netbox_status_close_arp instead #2928

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay ARP record closures for devices that have been down for a while #2913

Delay ARP record closures for devices that have been down for a while #2913

lunkwill42 commented May 14, 2024

lunkwill42 commented May 14, 2024 •

edited

Loading

codecov bot commented May 14, 2024 •

edited

Loading

github-actions bot commented May 14, 2024 •

edited

Loading

johannaengland left a comment

lunkwill42 commented May 15, 2024

lunkwill42 commented May 15, 2024

johannaengland commented May 16, 2024

johannaengland left a comment

stveit left a comment

stveit left a comment

Delay ARP record closures for devices that have been down for a while #2913

Delay ARP record closures for devices that have been down for a while #2913

Conversation

lunkwill42 commented May 14, 2024

lunkwill42 commented May 14, 2024 • edited Loading

codecov bot commented May 14, 2024 • edited Loading

Codecov Report

github-actions bot commented May 14, 2024 • edited Loading

Test results

johannaengland left a comment

Choose a reason for hiding this comment

lunkwill42 commented May 15, 2024

lunkwill42 commented May 15, 2024

johannaengland commented May 16, 2024

johannaengland left a comment

Choose a reason for hiding this comment

stveit left a comment

Choose a reason for hiding this comment

stveit left a comment

Choose a reason for hiding this comment

lunkwill42 commented May 14, 2024 •

edited

Loading

codecov bot commented May 14, 2024 •

edited

Loading

github-actions bot commented May 14, 2024 •

edited

Loading