Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some routes are not activated when they are advertised via OSPF point-to-multipoint #14709

Closed
adrianomarto opened this issue Nov 1, 2023 · 6 comments
Labels
triage Needs further investigation

Comments

@adrianomarto
Copy link
Contributor

adrianomarto commented Nov 1, 2023

Some routes are not activated when they are advertised via OSPF point-to-multipoint

Not all routes are activated when they are received via OSPF point-to-multipoint. I created a topotest case that easily reproduces the error.

I noticed that:

  • The issue does not happen if we use the directive "ip ospf prefix-suppression".
  • The issue only happens if the IP address of the ospf-active (point-to-multipoint) is higher than the IP address of the ospf-passive interface. It seems to be related to the order the routes/prefixes are advertised.

Checklist:

  • [ x ] Did you check if this is a duplicate issue?
    I couldn't find an issue describing the same problem.
  • [ x ] Did you test it on the latest FRRouting/frr master branch? I
    It is fixed in the master branch, but not in dev/9.1.

To Reproduce
The pull request implements a test case to reproduce the issue:
#14708

Expected behaviour
All shared networks must be reachable from all routers.

Screenshots
N/A

Versions

  • OS Version: Fedora 38 and Debian 12
  • Kernel: 5.10.0-21-amd64 1 SMP Debian 5.10.162-1 (2023-01-21) x86_64
  • FRR Version: dev/9.1 and 9.0.1

Additional context
This issue is fixed in the master branch, but I don't know which change has fixed it.

@ton31337
Copy link
Member

To be sure, which versions are affected by this bug? Because you said master is OK, but others including 9.1 are bad?

@adrianomarto
Copy link
Contributor Author

Yes, that is correct. The bug affects version 9.1 but it is OK in the master branch. It seems that the commit 4a96ff2 broke it, but I still could not figure out the commit that fixed it after that.

@adrianomarto
Copy link
Contributor Author

These are the versions I tested:

  • master: good
  • 9.1: bad
  • 9.0.1 bad
  • 8.5.3: good
  • 8.5.2: good
  • 8.5.1: good

@adrianomarto
Copy link
Contributor Author

This is the commit that fixed the issue in the master branch. Perhaps we could backport it.

a272a2b is the first bad commit
commit a272a2b
Author: Donald Sharp [email protected]
Date: Thu Oct 19 16:38:12 2023 -0400

zebra: Allow longer prefix matches for nexthops

Zebra currently does a shortest prefix match for
resolving nexthops for a prefix.  This is typically
an ok thing to do but fails in several specific scenarios.
If a nexthop matches to a route that is not usable, nexthop
resolution just gives up and refuses to use that particular
route.  For example if zebra currently has a covering prefix
say a 10.0.0.0/8.  And about the same time it receives a
10.1.0.0/16 ( a more specific than the /8 ) and another
route A, who's nexthop is 10.1.1.1.  Imagine the 10.1.0.0/16
is processed enough to know we want to install it and the
prefix is sent to the dataplane for installation( it is queued )
and then route A is processed, nexthop resolution will fail
and the route A will be left in limbo as uninstallable.

Let's modify the nexthop resolution code in zebra such that
if a nexthop's most specific match is unusable, continue looking
up the table till we get to the 0.0.0.0/0 route( if it's even
installed ).  If we find a usable route for the nexthop accept
it and use it.

The bgp_default_originate topology test is frequently failing
with this exact problem:

B>* 0.0.0.0/0 [200/0] via 192.168.1.1, r2-r1-eth0, weight 1, 00:00:21
B   1.0.1.17/32 [200/0] via 192.168.0.1 inactive, weight 1, 00:00:21
B>* 1.0.2.17/32 [200/0] via 192.168.1.1, r2-r1-eth0, weight 1, 00:00:21
C>* 1.0.3.17/32 is directly connected, lo, 00:02:00
B>* 1.0.5.17/32 [20/0] via 192.168.2.2, r2-r3-eth1, weight 1, 00:00:32
B>* 192.168.0.0/24 [200/0] via 192.168.1.1, r2-r1-eth0, weight 1, 00:00:21
B   192.168.1.0/24 [200/0] via 192.168.1.1 inactive, weight 1, 00:00:21
C>* 192.168.1.0/24 is directly connected, r2-r1-eth0, 00:02:00
C>* 192.168.2.0/24 is directly connected, r2-r3-eth1, 00:02:00
B>* 192.168.3.0/24 [20/0] via 192.168.2.2, r2-r3-eth1, weight 1, 00:00:32
B   198.51.1.1/32 [200/0] via 192.168.0.1 inactive, weight 1, 00:00:21
B>* 198.51.1.2/32 [20/0] via 192.168.2.2, r2-r3-eth1, weight 1, 00:00:32

Notice that the 1.0.1.17/32 route is inactive but the nexthop
192.168.0.1 is covered by both the 192.168.0.0/24 prefix( shortest match )
*and* the 0.0.0.0/0 route ( longest match ).  When looking at the logs
the 1.0.1.17/32 route was not being installed because the matching
route was not in a usable state, which is because the 192.168.0.0/24
route was in the process of being installed.

Signed-off-by: Donald Sharp <[email protected]>

Copy link

github-actions bot commented Aug 6, 2024

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

@frrbot
Copy link

frrbot bot commented Aug 6, 2024

This issue will be automatically closed in the specified period unless there is further activity.

@frrbot frrbot bot closed this as completed Aug 13, 2024
@frrbot frrbot bot closed this as completed Aug 13, 2024
@frrbot frrbot bot removed the autoclose label Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

2 participants