-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bgp nexthop group optimisations #15488
Draft
pguibert6WIND
wants to merge
69
commits into
FRRouting:master
Choose a base branch
from
pguibert6WIND:bgp_nhg_optimisations
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Bgp nexthop group optimisations #15488
pguibert6WIND
wants to merge
69
commits into
FRRouting:master
from
pguibert6WIND:bgp_nhg_optimisations
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pguibert6WIND
force-pushed
the
bgp_nhg_optimisations
branch
from
March 5, 2024 15:58
9c19460
to
40c4f30
Compare
This was referenced Mar 5, 2024
pguibert6WIND
force-pushed
the
bgp_nhg_optimisations
branch
from
March 5, 2024 17:02
40c4f30
to
241ecc1
Compare
pguibert6WIND
force-pushed
the
bgp_nhg_optimisations
branch
5 times, most recently
from
March 7, 2024 15:25
06e873e
to
ce1577d
Compare
Nexthop tracking changes should be handled for each BGP nexthop-group, prior to evaluate each BGP path. Add a hook on nexthop tracking to parse the list of available bgp_nhg_cache entries matching the changed nexthop tracking context. The nexthop-group is either detached and removed from the BGP paths; or updated to take into account the new resolved path. Signed-off-by: Philippe Guibert <[email protected]>
Before using nexthop-groups, ZEBRA could cancel route installation: - if the prefix was not a host prefix and was resolving over itself - if the prefix was resolving over the default route whereas the resolution to the default route was unconfigured on ZEBRA. When using BGP nexthop-grups, at nexthop-group installation, ZEBRA does not know the prefix, and can not do this control. Add the control at BGP level to compare BGP prefixes and the nexthop resolution: the control will eventually install the route without using the nexthop-group. Signed-off-by: Philippe Guibert <[email protected]>
This topology uses an IGP network with BGP between r1, and r5/r6 on the one hand, and two CEs at the edge side, on the other hand. The bgp nexthop-group mode is used, meaning that BGP gathers prefixes which use a shared nexthop together, and use the NHG_ADD zapi messages to inform ZEBRA. Consequently, ZEBRA will pick up an identifier (either the one proposed by BGP or an other one chosen by ZEBRA), and will stick that identifier to the BGP nexthop. The test ensures that that identifier is correctly used in the ZEBRA RIB, and the Linux FIB, even when the IGP resolution changes (interfaces going down, or remote IGP label value change). Signed-off-by: Philippe Guibert <[email protected]>
Some basic path information is used by the 'show bgp nexthops detail' command. This information only deals with path information, and could be used by other show commands to list bgp_paths. Move the specific code from bgp_nexthop.c to bgpd.c - bgp_show_bgp_path_info_flags() is renamed to bgp_path_info_show_flags() - add a new function bgp_path_info_display() to display this information The AS of the bgp path is displayed based on the BGP instance, where the path is. Signed-off-by: Philippe Guibert <[email protected]>
Add a vty function that displays the nexthop groups used by BGP. > r1# show bgp nexthop-group detail > ID: 75757571, #paths 1 > Flags: 0x0000 > State: 0x0001 (Installed) > via 172.31.10.7 (vrf vrf1) inactive, weight 1 > Paths: > 1/1 192.0.2.7/32 VRF vrf1 flags 0x418 > ID: 75757572, #paths 1 > Flags: 0x0000 > State: 0x0001 (Installed) > via 172.31.11.8 (vrf vrf2) inactive, weight 1 > Paths: > 1/1 192.0.2.7/32 VRF vrf2 flags 0x418 > ID: 75757575, #paths 1 > Flags: 0x0003 (allowRecursion, internalBgp) > State: 0x0001 (Installed) > via 192.0.2.6 (vrf default) inactive, weight 1 > Paths: > 1/1 192.0.2.9/32 VRF default flags 0x418 > ID: 75757576, #paths 1 > Flags: 0x0003 (allowRecursion, internalBgp) > State: 0x0001 (Installed) > via 192.0.2.6 (vrf default) inactive, label 6000, weight 1 > Paths: > 1/1 192.0.2.9/32 VRF vrf1 flags 0x4018 Signed-off-by: Philippe Guibert <[email protected]>
When BGP addpath is used, BGP prefixes may use one or multiple nexthops. To implement multipath with BGP nexthop-groups, each nexthop should be handled separately for each single path, and at the same time, grouped for the multiple paths. Because of this requirement, the choice of having a two level nexthop group hierarchy with two kind of nexthop-groups is done: - child nexthop-group contain the original bgp nexthop-group definition. Each child can have only one nexthop. - parent nexthop-group contain the child nexthop-group identifiers A nhg_childs tree implements the connection of the parent to one or multiple childs (depending if it is a single or multiple path) A nhg_groups tree implements the connection of the child to the parents using that child (because one nexthop may be used by single or multiple paths). Signed-off-by: Philippe Guibert <[email protected]>
The bgp_nhg_cache entry needs to handle an hierarchy to either reference parent groups or child nexthops. Move the original nexthop information into a sub structure and use the new 'nexthops' instance into the original bgp_nhg_cache entry. Signed-off-by: Philippe Guibert <[email protected]>
The list of paths will be available for any kind of bgp nhg. To avoid code redundancy, a rework is done. Move the code of that parsing from the show_bgp_nhg_id_helper(), to a separte show_bgp_nhg_id_helper_detail(). Signed-off-by: Philippe Guibert <[email protected]>
Introducing a nexthop-group hierarchy requires to handle child and parent nexthop-groups in different ways. Reusing a same child nexthop-group by using hash entries based on nexthop attributes makes sense, and guarantees that each nexthop is unique. For instance, at installation time of an incoming BGP update, BGP will figure out an existing nexthop exists. For parent nexthop-groups, using key based on the number of childs, and the id of each child is not wishable: this implies that at failover events, the key of the parent nexthop-group would be recomputed, and may conflict with an already present parent nexthop-group. This would result in moving bgp paths to a new nexthop-group, whereas the goal of this implementation is to minimise the number of changes at failover events. This is why it is proposed to have a separate hash list for parent nexthop-groups, which will be ordered by id: each id is unique and will never conflict with a parent that shares the same childs. Signed-off-by: Philippe Guibert <[email protected]>
Extend the current BGP nexthop-group API to support an hierarchy between child nexthop-groups and parent nexthop-groups. - The original bgp_nhg_cache hash entry is used to store both kind of nexthop-groups. - Extend the debug API to dump parent nexthop groups. - Extend the BGP API to send NHG_CHILD_ADD command to zebra. - Extend the BGP NHG add and delete API by sorting out the NHG in the appropriate hash-list - Extend the BGP NHG show command to reach both hash lists. Signed-off-by: Philippe Guibert <[email protected]>
Add a new 'bgp_nhg_nexthop' pointer in each bgp path. This pointer is the child nexthop-group, is not used in this commit. Signed-off-by: Philippe Guibert <[email protected]>
Unlock the possibility to use nexthop groups for ECMP paths. For each path to install, two nexthop groups will be used: parent (bgp_nhg attribute) and child (bgp_nhg_nexthop attribute). Handle the ZEBRA installation order at protocol level: the nexthop first, the parent after. Unlock the ability to handle multiple paths with NHGs. At nexthop tracking event, when a nexthop is invalid, the child nexthop-groups are detached, and the parent nexthop-group is updated. Signed-off-by: Philippe Guibert <[email protected]>
When BGP peering is considered down due to BFD event, or when the operator performs a clear command on a given peer, the parent nexthop groups are never kept. Actually, upon each bgp path removal, the peer routing table paths are flushed, and the two bgp nexthop-groups of each path are detached, whereas only the child nexthop-group of each path of the peer should be detached. Prevent from detaching the parent nexthop-group by introducing a two step procedure: - for each peer failover event, the bgp tables are parsed: for each path of the peer, only the child nexthop-group is detached. - when all tables are parsed, a check is done on the unused nexthops. The same parent will eventually be updated with the updated child group number. Signed-off-by: Philippe Guibert <[email protected]>
The 'show bgp nexthop-group' function is extended to display the parent and child information. Display if a nexthop-group is a child or a parent. Display the nexthop-group interconnections between child list and parent list. > r1# show bgp nexthop-group detail > ID: 75757571, #paths 1 > Flags: 0x0003 (allowRecursion, internalBgp) > State: 0x0001 (Installed) > via 192.0.2.5 (vrf default) inactive > parent group 75757574 <--- added > Paths: <--- added > 1/1 192.0.2.9/32 VRF default flags 0xc10 > ID: 75757573, #paths 1 > Flags: 0x0003 (allowRecursion, internalBgp) > State: 0x0001 (Installed) > via 192.0.2.6 (vrf default) inactive > parent group 75757574 > ID: 75757574, #paths 2 > Flags: 0x000b (allowRecursion, internalBgp, TypeParent) > State: 0x0001 (Installed) > child list count 2 <--- added > childi(s) 75757571 75757573 <--- added > r1# show bgp nexthop-group json > [..] > { > "nhgId":75757575, > [..] > "parentList":[ <--- added > { > "Id":75757576 > } > ] > }, > { > "nhgId":75757576, > [..] > "pathCount":3, > "childListCount":3, <--- added > "childList":[ <--- added > { > "Id":75757571 > }, > { > "Id":75757573 > }, > { > "Id":75757575 > } > ], > "paths": [ <--- added > { > "afi": "IPv4", > "safi:" "unicast", > "prefix": "192.0.2.9/32", > [..] > } > ] > } >] Signed-off-by: Philippe Guibert <[email protected]>
Until now, only the PATH_SELECTED path had a link to the parent nexthop group. This may be a problem if we have a single path with no bgp_nhg, and a bgp_nexthop. Instead of parsing the whole path list for a given prefix, reuse the bgp_nhg pointer for all entries. Signed-off-by: Philippe Guibert <[email protected]>
When using BGP nexthop-groups with route, it is not needed to reinstall routes when only the nexthop needs to be updated for any reason (IGP, or same incoming BGP update received with nexthop modification). Fix this by introducing a flag at node level, to know if a route update is needed or not. This feature works if 'bgp suppress-fib-enabled' functionality is used. Without that, the BGP service does not know if a route has already been installed or not. Signed-off-by: Philippe Guibert <[email protected]>
When a BFD failure happens with multiple same nexthops belonging to different peers, the parent nexthop group is not refreshed, and at BGP refresh, a new parent NHGID is chosen. The below case shows the same prefix with the same nexthop learned from 3 different peers. A single nexthop group is created for all the three identical nexthops. > r1# show bgp nexthop-group 71428590 detail > ID: 71428590, #paths 6 > Flags: 0x000b (allowRecursion, internalBgp, TypeParent) > State: 0x0001 (Installed) > child list count 1 (peer count %u) > child(s) 71428577 > Paths: > 1/1 172.18.1.101/32 VRF default flags 0xc10 > 1/1 172.18.1.101/32 VRF default flags 0xc10 > 1/1 172.18.1.101/32 VRF default flags 0x418 > 1/1 172.18.1.100/32 VRF default flags 0xc10 > 1/1 172.18.1.100/32 VRF default flags 0xc10 > 1/1 172.18.1.100/32 VRF default flags 0x418 > r1# show bgp nexthop-group 71428577 detail > ID: 71428577, #paths 6 > Flags: 0x0003 (allowRecursion, internalBgp) > State: 0x0001 (Installed) > via 172.16.0.100 (vrf default) inactive > parent list count 1 > parent(s) 71428590 > Paths: > 1/1 172.18.1.101/32 VRF default flags 0x418 > 1/1 172.18.1.100/32 VRF default flags 0x418 > 1/1 172.18.1.100/32 VRF default flags 0xc10 > 1/1 172.18.1.101/32 VRF default flags 0xc10 > 1/1 172.18.1.101/32 VRF default flags 0xc10 > 1/1 172.18.1.100/32 VRF default flags 0xc10 > r1# show bgp ipv4 172.18.1.101/32 > BGP routing table entry for 172.18.1.101/32, version 26 > Paths: (3 available, best FRRouting#1, table default, vrf (null)) > Not advertised to any peer > Local > 172.16.0.100 from 192.0.2.5 (192.0.2.5) > Origin incomplete, metric 0, localpref 100, valid, internal, multipath, best (Router ID) > AddPath ID: RX 5, TX-All 0 TX-Best-Per-AS 0 TX-Best-Selected 0 > Last update: Fri Sep 13 14:47:14 2024 > Local > 172.16.0.100 from 192.0.2.3 (192.0.2.8) > Origin incomplete, metric 0, localpref 100, valid, internal, multipath > Originator: 192.0.2.8, Cluster list: 192.0.2.3 > AddPath ID: RX 18, TX-All 0 TX-Best-Per-AS 0 TX-Best-Selected 0 > Last update: Fri Sep 13 14:45:12 2024 > Local > 172.16.0.100 from 192.0.2.3 (192.0.2.6) > Origin incomplete, metric 0, localpref 100, valid, internal, multipath > Originator: 192.0.2.6, Cluster list: 192.0.2.3 > AddPath ID: RX 13, TX-All 0 TX-Best-Per-AS 0 TX-Best-Selected 0 > Last update: Fri Sep 13 14:45:12 2024 After failover, the parent nexthop group is replaced, and the failover takes time due to the changing nexthop associated to that nexthop, and all prefixes using that nexthop. Instead, the same parent NHGID should be kept. Create an hash list of peers for each child nexthop. Before failover, 3 peers are mentioned along with the exact path counter. At failover, the failed peer decrements, and the parent NHGID is refreshed > 2024/09/13 13:52:11.277173 BGP: [MNZ4S-8HW2T] NHG 71428578: peer count changed (3 -> 2) for nexthop (172.16.0.100 if 0 VRF 0 wt 0 ) Signed-off-by: Philippe Guibert <[email protected]>
Add BGP user documentation about nexthop-group (show, debug, config). Signed-off-by: Philippe Guibert <[email protected]>
Add technical information about BGP nexthop groups. Signed-off-by: Philippe Guibert <[email protected]>
When using bgp nexthop groups, it is possible to avoid route updates, and this test ensures that the number of zapi route messages from BGP to ZEBRA can be reduced, for instance when a failover happens. Signed-off-by: Philippe Guibert <[email protected]>
When not used by default, bgp nexthop group are not tested against the following route entry added: - route entry resolved over default route - route entry resolved over itself - route entry resolved over blackhole route - route entry connected (case mplsvpn) - route entry using ipv4 mapped ipv6 address This test suite covers the above use cases. Signed-off-by: Philippe Guibert <[email protected]>
When using BGP nexthop-groups, BGP does not add inactive nexthops to zebra routes. This is the case in the bgp_peer_type_multipath test, where step8 fails to match the 203.0.113.8/30 prefix. > # show bgp ipv4 > [..] > *= 203.0.113.8/30 198.51.100.10 0 64503 i > *> 10.0.3.2 0 64502 i With BGP nexthop-groups, BGP installs the valid nexthop. > # show ip route > [..] > B>* 203.0.113.8/30 [20/0] via 10.0.3.2, r1-eth1, weight 1, 00:22:49 Without BGP nexthop-group, BGP adds 2 nexthops in zebra, including the inactive one. > # show ip route > [..] > B>* 203.0.113.8/30 [20/0] via 10.0.3.2, r1-eth1, weight 1, 00:00:21 > * via 198.51.100.10 inactive, weight 1, 00:00:21 The option to permit sending inactive nexthops has been tested, and leads to zebra error messages, which result in having non deterministic output: > 2024/07/03 16:22:20.137637 ZEBRA: [RTQGY-B7JCM] zapi_read_nexthop_group: nhgroup=75757592) > 2024/07/03 16:22:20.137639 ZEBRA: [P3R7K-NKGH6][EC 4043309131] zapi_read_nexthop_group: Nexthops Groups Specified: 3(75757592) not found > 2024/07/03 16:22:20.137688 ZEBRA: [NTQ7Y-ATGH0][EC 4043309131] zread_nhg_child_add: Nexthop Group Creation failed An other option to modify the test to ignore the invalid nexthop entry is preferred. This is what this commit will propose. Signed-off-by: Philippe Guibert <[email protected]>
Modify the current test, to add recursive BGP routes: those routes will resolve over BGP routes, similar to what happens when a full route is received. The IGP failure test checks that the NHIDs from those recursive routes is not changed, and that a nexthop group message update to ZEBRA is enough to change the routing to all incoming BGP routes. The BFD failover test is implemented, and warns that the nexthop group ID has changed. The check_route test is modified to ignore the duplicated nexthops in the FIB. Counting the number of duplicated nexthops is a pain with recursive BGP routes, as per the below output: > r1# show ip route 172.18.1.100 > Routing entry for 172.18.1.100/32 > Known via "bgp", distance 200, metric 0, best > Last update 00:00:13 ago > 172.16.0.200 (recursive), weight 1 > * 172.31.0.3, via r1-eth1, label 16055, weight 1 > * 172.31.2.4, via r1-eth2, label 16055, weight 1 > * 172.31.0.3, via r1-eth1, label 16006, weight 1 > * 172.31.2.4, via r1-eth2, label 16006, weight 1 > * 172.31.8.7, via r1-eth4, label 16008, weight 1 > 172.16.0.200 (duplicate nexthop removed) (recursive), weight 1 > 172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16055, weight 1 > 172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16055, weight 1 > 172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16006, weight 1 > 172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16006, weight 1 > 172.31.8.7, via r1-eth4 (duplicate nexthop removed), label 16008, weight 1 > > r1# show ip route 172.22.1.100 > Routing entry for 172.22.1.100/32 > Known via "bgp", distance 200, metric 0, best > Last update 00:00:17 ago > 172.18.1.200 (recursive), weight 1 > * 172.31.0.3, via r1-eth1, label 16055, weight 1 > * 172.31.2.4, via r1-eth2, label 16055, weight 1 > * 172.31.0.3, via r1-eth1, label 16006, weight 1 > * 172.31.2.4, via r1-eth2, label 16006, weight 1 > * 172.31.8.7, via r1-eth4, label 16008, weight 1 > 172.18.1.200 (duplicate nexthop removed) (recursive), weight 1 > 172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16055, weight 1 > 172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16055, weight 1 > 172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16006, weight 1 > 172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16006, weight 1 > 172.31.8.7, via r1-eth4 (duplicate nexthop removed), label 16008, weight 1 > 172.18.1.200 (duplicate nexthop removed) (recursive), weight 1 > 172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16055, weight 1 > 172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16055, weight 1 > 172.31.0.3, via r1-eth1 (duplicate nexthop removed), label 16006, weight 1 > 172.31.2.4, via r1-eth2 (duplicate nexthop removed), label 16006, weight 1 > 172.31.8.7, via r1-eth4 (duplicate nexthop removed), label 16008, weight 1 > > r1# Signed-off-by: Philippe Guibert <[email protected]>
Move some common routines to lib common parts. Signed-off-by: Philippe Guibert <[email protected]>
The nexthop group test suite is extended with multipath support. The test ensures that on a given multipath configuration, when a failover happens, the new resulting paths use the same nexthop group without the failed nexthop. Signed-off-by: Philippe Guibert <[email protected]>
Compared to frr8, the 1001::/64 route is not recursive > r1# show ipv6 route 1001::/64 > Routing entry for 1001::/64 > Known via "bgp", distance 200, metric 0, best > Last update 00:05:06 ago > * ::ffff:192.0.2.4, via r1-eth0, weight 1 The reason of this change is unknown! Signed-off-by: Philippe Guibert <[email protected]>
Add a test similar to bgp_nhg_zapi_scalability ibgp test, except that a route server is put in place in r3 insted of a route reflector. ebgp configuration is done so that ebgp multihop is accepted and as paths list is ignored. In that way, ECMP paths are visible. It is however not possible to have ECMP paths with 2 different ebgp peers, as neighbor IP comparison will be applied. So a part of the test is removed. Signed-off-by: Philippe Guibert <[email protected]>
pguibert6WIND
force-pushed
the
bgp_nhg_optimisations
branch
from
October 21, 2024 16:49
9d4f006
to
af27463
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch set proposes to bring BGP changes. Among the reasons of those changes:
Link: https://github.com/pguibert6WIND/SONiC/blob/proposal_bgp/doc/pic/bgp_pic_arch_doc.md#9-frrouting-ipc-messaging-from-bgp-to-zebra