EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

raja-rajasekar · 2024-11-26T23:24:18Z

The following are cases where it is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up. Below are few triggers as example

Interface Up/Down events

a bulk of L2VNIs are flapped i.e. flap of br_default interface (or)
a bulk of L3VNIs are flapped i.e. flap of br_l3vni interface

Anytime BGP gets a L2 VNI ADD (or) L3 VNI ADD/DEL from zebra,

Walking the entire global routing table per L2VNI is very expensive.
The next read (say of another VNI ADD) from the socket does not proceed unless this walk is complete.

So for triggers where a bulk of L2VNI's/L3VNIs are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message.

To avoid this, idea is to

hookup the VPN off the bgp_master struct and maintain a VPN FIFO list which is processed later on, where we walk a chunk of VPNs and do the remote route install.
hookup the BGP-VRF off the struct bgp_master and maintain a struct bgp FIFO list which is processed later on, where we walk a chunk of BGP-VRFs and do the remote route install/uninstall.

Note: So far in the L3 backpressure cases(#15524), we have considered the fact that zebra is slow, and the buffer grows in the BGP.

However, this is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up.

raja-rajasekar · 2024-11-26T23:28:09Z

For "bgpd: Suppress redundant L3VNI delete processing"

Instrumented logs without fix: ifdown br_l3vni

    74 2024/11/26 22:28:00.324443 ZEBRA: [Z9WYD-4ERFV] RAJA DOWN zebra_vxlan_svi_down
    75 2024/11/26 22:28:00.324450 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4004 VRF vrf4 to bgp

 17507 2024/11/26 22:28:00.876116 ZEBRA: [NVFT0-HS1EX] INTF_INSTALL for vxlan99(147)
 17508 2024/11/26 22:28:00.876197 ZEBRA: [WJRZ7-WE7M5] RTM_NEWLINK update for vxlan99(147) sl_type 0 master 0
 17509 2024/11/26 22:28:00.876203 ZEBRA: [PPSYY-6KJJP] Intf vxlan99(147) PTM up, notifying clients
 17510 2024/11/26 22:28:00.876314 ZEBRA: [W7XYW-5FTP2] RAJA999 in if_up
….
 17986 2024/11/26 22:28:00.886309 ZEBRA: [Y4TDE-84YR0] Update L3-VNI 4004 intf vxlan99(147) VLAN 2668 local IP 2.1.1.6 master 0 chg 0x2
 17987 2024/11/26 22:28:00.886311 ZEBRA: [XEW89-KXF0P] RAJA-DOWN 1 zebra_vxlan_if_update_vni for vni 4004
 17988 2024/11/26 22:28:00.886312 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4004 VRF vrf4 to bgp

Instrumented logs without fix: ifup br_l3vni

 359 2024/11/26 22:29:25.427495 ZEBRA: [M37B1-HQHSP] RTM_NEWVLAN for ifindex 147 NS 0, enqueuing for zebra main

 362 2024/11/26 22:29:25.427423 ZEBRA: [K8FXY-V65ZJ] Intf dplane ctx 0x7fc4d0027e10, op INTF_INSTALL, ifindex (147), result QUEUED
 363 2024/11/26 22:29:25.427428 ZEBRA: [NVFT0-HS1EX] INTF_INSTALL for vxlan99(147)
 364 2024/11/26 22:29:25.427465 ZEBRA: [TQR2A-H2RFY] Vlan-Vni(671:671-4005:4005) update for VxLAN IF vxlan99(147)
 365 2024/11/26 22:29:25.427483 ZEBRA: [QZ4F6-8EX79] zebra_vxlan_if_add_update_vni vxlan vxlan99 vni (4005, 671) not present in bridge table
 366 2024/11/26 22:29:25.427486 ZEBRA: [PWSYZ-A537X] zebra_evpn_acc_vl_new access vlan 671 bridge br_l3vni add
 367 2024/11/26 22:29:25.427569 ZEBRA: [Y4TDE-84YR0] Update L3-VNI 4005 intf vxlan99(147) VLAN 671 local IP 2.1.1.6 master 670 chg 0x4
 368 2024/11/26 22:29:25.427573 ZEBRA: [Z0ADA-V8CT4] RAJA-DOWN 3 zebra_vxlan_if_update_vni
 369 2024/11/26 22:29:25.427580 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4005 VRF vrf5 to bgp
….
 386 2024/11/26 22:29:25.428382 ZEBRA: [WVRMN-YEC5Q] Del L3-VNI 4001 intf vxlan99(147)
 387 2024/11/26 22:29:25.428384 ZEBRA: [WDB17-CBPCZ] RAJA DOWNzebra_vxlan_if_del_vni
 388 2024/11/26 22:29:25.428387 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4001 VRF vrf1 to bgp

With Fix: ifdown br_l3vni

   668 2024/11/26 19:18:26.344063 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf5 VNI 4005
   669 2024/11/26 19:18:26.344069 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4005 is already deleted
   670 2024/11/26 19:18:26.344092 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf1 VNI 4001
   671 2024/11/26 19:18:26.344093 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4001 is already deleted
   672 2024/11/26 19:18:26.344114 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf4 VNI 4004
   673 2024/11/26 19:18:26.344115 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4004 is already deleted
   674 2024/11/26 19:18:26.344135 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf3 VNI 4003
   675 2024/11/26 19:18:26.344136 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4003 is already deleted
   676 2024/11/26 19:18:26.344157 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf2 VNI 4002
   677 2024/11/26 19:18:26.344158 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4002 is already deleted
….
   688 2024/11/26 19:18:26.344517 BGP: [XXJ7P-NWW2X] Rx L3VNI ADD VRF vrf3 VNI 4003 Originator-IP 2.1.1.6 RMAC svi-mac 1c:34:da:23:4f:fd vrr-mac 1c:34:da:23:4f:fd filter none svi-if 5517

With Fix: ifup br_l3vni

 8546 2024/11/26 19:26:23.400423 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf1 VNI 4001
  8547 2024/11/26 19:26:23.400435 BGP: [T0MP2-YRTMX] Scheduling L3VNI DEL to be processed later for VRF vrf1 VNI 4001
  8548 2024/11/26 19:26:23.402722 BGP: [GFHWV-99P7C] Rx Intf down VRF vrf1 IF vlan2501_l3
  8549 2024/11/26 19:26:23.404025 BGP: [G49HN-S8M77] Rx Intf address del VRF vrf1 IF vlan2501_l3 addr fe80::1e34:daff:fe23:4ffd/64
  8550 2024/11/26 19:26:23.404397 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf2 VNI 4002
  8551 2024/11/26 19:26:23.404401 BGP: [T0MP2-YRTMX] Scheduling L3VNI DEL to be processed later for VRF vrf2 VNI 4002

140642 2024/11/26 19:26:26.165399 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf1 VNI 4001
140643 2024/11/26 19:26:26.165410 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4001 is already deleted
…..
145672 2024/11/26 19:26:26.236828 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf4 VNI 4004
145673 2024/11/26 19:26:26.236836 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4004 is already deleted

ton31337

TBH, I don't understand the goal of 58e8563. Why not bundle together with the real changes?

bgpd/bgp_evpn.c

lib/prefix.h

lib/zclient.h

raja-rajasekar · 2024-12-02T03:05:27Z

On a setup with 2501 L2VNI’s 5 L3VNI’s and 200K EVPN scale
sudo vtysh -c "sh evpn vni" | grep L2 | wc -l
2501
sudo vtysh -c "sh evpn vni" | grep L3 | wc -l
5

Test Case:

Bootup the node
wait until convergence
ifdown/ifup br_default (flaps all 2500 VNIs) in intervals of 12,10,8,6 seconds gap.
wait until convergence and check the numbers

Without Fix: (Output Fifo rose from 2.7K to 120K with stream size 1.5MB to 639MB)

####################### Initial State (Without Fix) #######################
sudo vtysh -c "show bgp l2vpn evpn sum"
Neighbor             V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
superspine1(2.1.1.1) 4  744603565    106035     87045        0    0    0 00:07:56       214331   281614 N/A
superspine2(2.1.1.2) 4  744603565    105674     87045        0    0    0 00:07:56       214331   281614 N/A

sudo vtysh -c "sh zebra client" | grep "Client:\|Output"
Client: bgp
Input Fifo: 0:999 Output Fifo: 0:2647 >>>>>>>>>2.6K


  Total heap allocated:  126 MiB
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        7     24         168        7       168
Buffer data                   :        1   4120        4120       53    218360
Stream                        :        4 variable    185056     2653   1528184 >>>>>>> 1.5MB
Stream FIFO                   :        5     72         360        8       576

sudo vtysh -c "sh mem bgpd" | grep "Buffer\|Stream\|Total"
  Total heap allocated:  810 MiB
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        5     24         120        5       120
Buffer data                   :        1   4120        4120        5     20600
Stream                        :        9 variable    555656    20128   3910944
Stream FIFO                   :       26     72        1872       30      2160

#######################Final State(Without Fix) #######################
sudo vtysh -c "show bgp l2vpn evpn sum"
Neighbor             V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
superspine1(2.1.1.1) 4  744603565    119236     89000        0    0    0 00:15:03       214331   216133 N/A
superspine2(2.1.1.2) 4  744603565    118869     89000        0    0    0 00:15:03       214331   216133 N/A

sudo vtysh -c "sh zebra client" | grep "Client:\|Output"
Client: bgp
Input Fifo: 23:999 Output Fifo: 119841:119840 >>>120K

sudo vtysh -c "sh mem zebra" | grep "Buffer\|Stream\|Total"
  Total heap allocated:  700 MiB
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        7     24         168        9       216
Buffer data                   :      188   4120      774560     1227   5055240
Stream                        :       10 variable    185392   121346 639680968  >>>>>>>639MB
Stream FIFO                   :        5     72         360        8       576

sudo vtysh -c "sh mem bgpd" | grep "Buffer\|Stream\|Total"
  Total heap allocated:  805 MiB
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        5     24         120        9       216
Buffer data                   :        1   4120        4120    11587  47738680
Stream                        :        3 variable    159128    20128   3910944
Stream FIFO                   :       26     72        1872       30      2160

With Fix: (Output Fifo rose from 2K to 11K with stream size 27MB to 81MB)

####################### Initial State (With Fix) #######################
sudo vtysh -c "show bgp l2vpn evpn sum"
Neighbor             V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
superspine1(2.1.1.1) 4  744603565     14578     15073        0    0    0 00:00:20       214331   218666 N/A
superspine2(2.1.1.2) 4  744603565     14578     15073        0    0    0 00:00:20       214331   218666 N/A

sudo vtysh -c "sh zebra client" | grep "Client:\|Output"
Client: bgp
Input Fifo: 0:999 Output Fifo: 0:1984 >>>>>>2K

sudo vtysh -c "sh mem zebra" | grep "Buffer\|Stream\|Total"
  Total heap allocated:  103 MiB
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        7     24         168        7       168
Buffer data                   :        1   4120        4120       50    206000
Stream                        :        4 variable    185056     4935  27899712 >>>> 27MB
Stream FIFO                   :        5     72         360        8       576


sudo vtysh -c "sh mem bgpd" | grep "Buffer\|Stream\|Total"
  Total heap allocated:  688 MiB
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        5     24         120        5       120
Buffer data                   :        1   4120        4120        5     20600
Stream                        :        9 variable    555656    19491  15738072
Stream FIFO                   :       26     72        1872       26      1872


#######################Final State(With Fix) #######################
sudo vtysh -c "show bgp l2vpn evpn sum"
Neighbor             V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
superspine1(2.1.1.1) 4  744603565     73472     32176        0    0    0 00:04:34       214331   218639 N/A
superspine2(2.1.1.2) 4  744603565     73420     32176        0    0    0 00:04:34       214331   218639 N/A


sudo vtysh -c "sh zebra client" | grep "Client:\|Output"
Client: bgp
Input Fifo: 0:999 Output Fifo: 0:11174

sudo vtysh -c "sh mem zebra" | grep "Buffer\|Stream\|Total"
  Total heap allocated:  105 MiB
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        9     24         216        9       216
Buffer data                   :     1374   4120     5660880     1393   5739160
Stream                        :     1004 variable    257056    11228  81240968 >>>>>81MB
Stream FIFO                   :        5     72         360        8       576

sudo vtysh -c "sh mem bgpd" | grep "Buffer\|Stream\|Total"
  Total heap allocated:  688 MiB
Type                          : Current#   Size       Total     Max#  MaxBytes
Buffer                        :        5     24         120        9       216
Buffer data                   :        1   4120        4120    11651  48002120
Stream                        :        3 variable    159128    19491  15738072
Stream FIFO                   :       26     72        1872       26      1872

tests/topotests/evpn_type5_test_topo1/test_evpn_type5_topo1.py

tests/topotests/evpn_type5_test_topo1/test_evpn_type5_chaos_topo1.py

bgpd/bgp_evpn.c

Adds a msg list for getting strings mapping to enum bgp_evpn_route_type Ticket: #3318830 Signed-off-by: Trey Aspelund <[email protected]>

Anytime BGP gets a L2 VNI ADD from zebra, - Walking the entire global routing table per L2VNI is very expensive. - The next read (say of another VNI ADD) from the socket does not proceed unless this walk is complete. So for triggers where a bulk of L2VNI's are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message. To avoid this, idea is to hookup the VPN off the bgp_master struct and maintain a VPN FIFO list which is processed later on, where we walk a chunk of VPNs and do the remote route install. Note: So far in the L3 backpressure cases(FRRouting#15524), we have considered the fact that zebra is slow, and the buffer grows in the BGP. However this is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>

Anytime BGP gets a L3 VNI ADD/DEL from zebra, - Walking the entire global routing table per L3VNI is very expensive. - The next read (say of another VNI ADD/DEL) from the socket does not proceed unless this walk is complete. So for triggers where a bulk of L3VNI's are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message. To avoid this, idea is to hookup the BGP-VRF off the struct bgp_master and maintain a struct bgp FIFO list which is processed later on, where we walk a chunk of BGP-VRFs and do the remote route install/uninstall. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>

Consider a master bridge interface (br_l3vni) having a slave vxlan99 mapped to vlans used by 3 L3VNIs. During ifdown br_l3vni interface, the function zebra_vxlan_process_l3vni_oper_down() where zebra sends ZAPI to bgp for a delete L3VNI is sent twice. 1) if_down -> zebra_vxlan_svi_down() 2) VXLAN is unlinked from the bridge i.e. vxlan99 zebra_if_dplane_ifp_handling() --> zebra_vxlan_if_update_vni() (since ZEBRA_VXLIF_MASTER_CHANGE flag is set) During ifup br_l3vni interface, the function zebra_vxlan_process_l3vni_oper_down() is invoked because of access-vlan change - process oper down, associate with new svi_if and then process oper up again The problem here is that the redundant ZAPI message of L3VNI delete results in BGP doing a inline Global table walk for remote route installation when the L3VNI is already removed/deleted. Bigger the scale, more wastage is the CPU utilization. Given the triggers for bridge flap is not a common scenario, idea is to simply return from BGP if the L3VNI is already set to 0 i.e. if the L3VNI is already deleted, do nothing and return. NOTE/TBD: An ideal fix is to make zebra not send the second L3VNI delete ZAPI message. However it is a much involved and a day-1 code to handle corner cases. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>

raja-rajasekar · 2024-12-09T19:49:21Z

ci:rerun

ton31337

LGTM

riw777

looks good

frrbot bot added bgp zebra labels Nov 26, 2024

github-actions bot added size/XL master labels Nov 26, 2024

raja-rajasekar force-pushed the rajasekarr/evpn_bp_and_optimizations_3864372_FINAL_upstream branch 2 times, most recently from b186ca3 to 6b68753 Compare November 27, 2024 06:53

ton31337 reviewed Nov 27, 2024

View reviewed changes

bgpd/bgp_evpn.c Show resolved Hide resolved

bgpd/bgp_evpn.c Outdated Show resolved Hide resolved

bgpd/bgp_evpn.c Outdated Show resolved Hide resolved

lib/prefix.h Outdated Show resolved Hide resolved

lib/zclient.h Outdated Show resolved Hide resolved

raja-rajasekar force-pushed the rajasekarr/evpn_bp_and_optimizations_3864372_FINAL_upstream branch from 6b68753 to 2cfe7bd Compare November 27, 2024 07:06

raja-rajasekar marked this pull request as draft November 27, 2024 07:11

raja-rajasekar force-pushed the rajasekarr/evpn_bp_and_optimizations_3864372_FINAL_upstream branch 7 times, most recently from 4accd1d to 2d1e5f4 Compare November 28, 2024 05:45

raja-rajasekar marked this pull request as ready for review December 2, 2024 03:06

raja-rajasekar requested a review from ton31337 December 2, 2024 03:28

ton31337 reviewed Dec 4, 2024

View reviewed changes

tests/topotests/evpn_type5_test_topo1/test_evpn_type5_topo1.py Outdated Show resolved Hide resolved

tests/topotests/evpn_type5_test_topo1/test_evpn_type5_chaos_topo1.py Outdated Show resolved Hide resolved

bgpd/bgp_evpn.c Outdated Show resolved Hide resolved

raja-rajasekar force-pushed the rajasekarr/evpn_bp_and_optimizations_3864372_FINAL_upstream branch from 2d1e5f4 to 7515018 Compare December 4, 2024 18:28

github-actions bot added the rebase PR needs rebase label Dec 4, 2024

raja-rajasekar requested a review from ton31337 December 6, 2024 17:51

Trey Aspelund and others added 4 commits December 9, 2024 08:46

bgpd: add EVPN route type msg list

8c71360

Adds a msg list for getting strings mapping to enum bgp_evpn_route_type Ticket: #3318830 Signed-off-by: Trey Aspelund <[email protected]>

raja-rajasekar force-pushed the rajasekarr/evpn_bp_and_optimizations_3864372_FINAL_upstream branch from 7515018 to bd32706 Compare December 9, 2024 16:46

ton31337 approved these changes Dec 12, 2024

View reviewed changes

riw777 approved these changes Dec 17, 2024

View reviewed changes

riw777 merged commit a3e0e4e into FRRouting:master Dec 17, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

raja-rajasekar commented Nov 26, 2024 •

edited

Loading

raja-rajasekar commented Nov 26, 2024

ton31337 left a comment

raja-rajasekar commented Dec 2, 2024

raja-rajasekar commented Dec 9, 2024

ton31337 left a comment

riw777 left a comment

EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

Conversation

raja-rajasekar commented Nov 26, 2024 • edited Loading

raja-rajasekar commented Nov 26, 2024

ton31337 left a comment

Choose a reason for hiding this comment

raja-rajasekar commented Dec 2, 2024

raja-rajasekar commented Dec 9, 2024

ton31337 left a comment

Choose a reason for hiding this comment

riw777 left a comment

Choose a reason for hiding this comment

raja-rajasekar commented Nov 26, 2024 •

edited

Loading