Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Withdrawing/announcing BGP routes in a loop with timeout scenario behaves differently with BGP suppress FIB feature enabled #14797

Closed
1 of 2 tasks
vadymhlushko-mlnx opened this issue Nov 14, 2023 · 4 comments
Labels
autoclose triage Needs further investigation

Comments

@vadymhlushko-mlnx
Copy link

vadymhlushko-mlnx commented Nov 14, 2023


Describe the bug

For the SONiC NOS, we have a BGP stress test that will do the following:

  1. Withdraw all BGP routes & wait 120 seconds and save a number of routes that are left (usually 32)
  2. In 10 times loop
    a. announce BGP routes
    b. wait 40 seconds
    c. withdraw BGP routes
    d. wait 40 seconds
  3. On the last 10-th iteration test withdraw BGP routes wait 100 seconds and save the number of routes that are left.
  4. Compare the number from Step 3 to the number from Step 1

On the SONiC image with FRR tag - frr-8.5.1, Kernel version - 5.10.0-23-2-amd64, Debian version - 5.10.179 and BGP suppress FIB feature enabled - from the switch logs you can observe that the switch is still busy announcing/withdrawing routes even after the tests completed (you could see 5-6 times announce/withdraw after test completed).

But it didn't happen, if the BGP suppress FIB feature is completely disabled (in SONiC and FRR)

  • Did you check if this is a duplicate issue?
  • Did you test it on the latest FRRouting/frr master branch?

To Reproduce

In order to reproduce we should simulate a "busy" zebra and add some routes via the ECMP group, then disable one of the ECMP group members in order to simulate the add/update route scenario:

  1. Establish BGP sessions on the switch with 3 neighbors - A1, A2, A3
  2. Go to the A2 and A3 neighbors and withdraw all BGP routes
  3. Stop the zebra process on the switch (in order to simulate busy zebra)
    a. kill -SIGSTOP pidof zebra
  4. Announce BGP routes from A2 neighbor
  5. 2-4 seconds delay
  6. Announce BGP routes from A3 neighbor
  7. Wait until A2 distributes routes and then withdraw routes
  8. Wait until A3 distributes routes and then withdraw routes
  9. Withdraw BGP routes from A3 neighbor
  10. Start the zebra process on the switch
    a. kill -SIGCONT pidof zebra

Expected behavior

As if the BGP suppress FIB is disabled.

Screenshots

Versions

  • OS Version: Debian version - 5.10.179
  • Kernel: 5.10.0-23-2-amd64
  • FRR Version: tag frr-8.5.1
  • BGP suppress FIB enabled/disabled

Additional context

@vadymhlushko-mlnx vadymhlushko-mlnx added the triage Needs further investigation label Nov 14, 2023
@vadymhlushko-mlnx
Copy link
Author

vadymhlushko-mlnx commented Nov 15, 2023

Test scenario with more details:

Let’s say we have 3 BGP neighbors ARISTA01T1 (A1) – 10.0.0.57, ARISTA02T1 (A2) – 10.0.0.59, ARISTA03T1 (A3) – 10.0.0.61, and we work only with 1 route (for simplicity):

  1. When only ARISTA01T1 (A1) are up, and the A2 and A3 down, we have such route
    a. 192.220.112.128/2 via 10.0.0.57(A1)
  2. We stop the zebra process on the switch (in order to simulate the “busy” zebra process)
    a. kill -SIGSTOP pidof zebra
  3. Announce BGP routes on A2
  4. Announce BGP routes on A3
  5. Withdraw BGP routes on A2
  6. Withdraw BGP routes on A3
  7. Announce BGP routes on A2
  8. Announce BGP routes on A3
  9. Withdraw BGP routes on A3
  10. We let the zebra process continue and process requests in the queue
    a. kill -SIGCONT pidof zebra

The routing table at the beginning will look like this:
192.220.112.128/2 via 10.0.0.57(A1)

The requests in the zebra queue will look like this:

  1. 192.220.112.128/2 via 10.0.0.57(A1), 10.0.0.59 (A2)
  2. 192.220.112.128/2 via 10.0.0.57(A1), 10.0.0.59 (A2), 10.0.0.61 (A3)
  3. 192.220.112.128/2 via 10.0.0.57(A1), 10.0.0.61 (A3)
  4. 192.220.112.128/2 via 10.0.0.57(A1)
  5. 192.220.112.128/2 via 10.0.0.57(A1), 10.0.0.59 (A2)
  6. 192.220.112.128/2 via 10.0.0.57(A1), 10.0.0.59 (A2), 10.0.0.61 (A3)
  7. 192.220.112.128/2 via 10.0.0.57(A1), 10.0.0.59 (A2)

FRR 8.5.1 (BGP suppress FIB disabled) will skip requests 1-6 and apply only the request number 7:

  1. 192.220.112.128/2 via 10.0.0.57(A1), 10.0.0.59 (A2)

FRR 8.5.1 ((BGP suppress FIB enabled) will make such changes to the routing table:

  1. 192.220.112.128/2 via 10.0.0.57(A1), 10.0.0.61 (A3)
  2. 192.220.112.128/2 via 10.0.0.57(A1)
  3. 192.220.112.128/2 via 10.0.0.57(A1), 10.0.0.59 (A2)

@ton31337
Copy link
Member

Can't this be simulated with sharpd?

@vadymhlushko-mlnx vadymhlushko-mlnx changed the title Withdrawing/announcing BGP routes in a loop with timeout scenario behaves differently in FRR 8.2 and FRR 8.5 Withdrawing/announcing BGP routes in a loop with timeout scenario behaves differently with BGP suppress FIB feature enabled Dec 6, 2023
@vadymhlushko-mlnx
Copy link
Author

vadymhlushko-mlnx commented Dec 6, 2023

Can't this be simulated with sharpd?

I'm not familiar with the sharpd.

I found out that this issue is related to the BGP suppressing FIB feature, so I updated the issue description.

Copy link

github-actions bot commented Jun 4, 2024

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoclose triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

2 participants