Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BMP Adj-RIB-Out (RFC8671), ECMP Support and Loc-RIB missing features #14847

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

mxyns
Copy link
Contributor

@mxyns mxyns commented Nov 21, 2023

Global:

  • added peer type for all monitored ribs
  • added peer distinguisher for all monitored ribs
  • include mpls labels (when possible, no adj-in-pre)

Add support for Adj-RIB-Out monitoring in BMP:

  • pre- and post-policy
  • config knobs
  • sync
  • incremental updates
  • stats (when possible)
  • no timestamp (currently not recorded in adj-out)

Add ECMP support in BMP:

  • include addpath id in bmp in adj-in pre/post, loc-rib and adj-out pre/post
  • monitoring on all ecmp selected path (BGP_PATH_SELECTED | BGP_PATH_MULTIPATH)

Add Loc-RIB missing features:

  • peer up/down on vrf state update
  • missing stats

Missing:

  • ?

@frrbot frrbot bot added the bgp label Nov 21, 2023
@frrbot frrbot bot added the tests Topotests, make check, etc label Nov 24, 2023
@github-actions github-actions bot added rebase PR needs rebase labels Nov 24, 2023
@mxyns mxyns force-pushed the pr-bmp-riboutmon branch 4 times, most recently from 98a978c to c4572c5 Compare November 27, 2023 16:46
@frrbot frrbot bot added the documentation label Nov 28, 2023
@mxyns mxyns force-pushed the pr-bmp-riboutmon branch 5 times, most recently from 7ab2e3a to 4234476 Compare November 28, 2023 17:31
@mxyns
Copy link
Contributor Author

mxyns commented Nov 28, 2023

ci:rerun

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@mxyns mxyns force-pushed the pr-bmp-riboutmon branch 2 times, most recently from 2597d3f to 4e8dca8 Compare November 29, 2023 10:17
@mxyns mxyns force-pushed the pr-bmp-riboutmon branch 3 times, most recently from 72ca5f7 to c0c8458 Compare December 1, 2023 15:35
@mxyns mxyns marked this pull request as ready for review December 2, 2023 10:58
@mxyns mxyns changed the title (WIP) BMP Adj-RIB-Out (RFC8671), ECMP Support and Loc-RIB missing features BMP Adj-RIB-Out (RFC8671), ECMP Support and Loc-RIB missing features Dec 2, 2023
@riw777 riw777 self-requested a review December 5, 2023 16:46
@ton31337 ton31337 self-requested a review December 5, 2023 16:52
@mxyns mxyns force-pushed the pr-bmp-riboutmon branch 2 times, most recently from 77b979d to 5b95fbb Compare November 3, 2024 10:42
@frrbot frrbot bot added the bugfix label Nov 3, 2024
@mxyns mxyns force-pushed the pr-bmp-riboutmon branch 2 times, most recently from 9b939ce to 16a4d73 Compare November 8, 2024 23:50
mxyns and others added 9 commits November 9, 2024 05:39
add multipath test and rename ribs
add rd statement to vrf configuration
now bmp cannot export information about a VRF without the RD being set (it is required for RD Instance Peer to have a Peer Distinguisher)

change expected output for bgp_bmp tests
also add dashes in ADJ_[IN|OUT]_[PRE|POST]_POLICY constants to have spaceless file names

add add-path to nlri parsing
add bmpserver exception printing
add 3rd peer in topotest for ecmp

Signed-off-by: Maxence Younsi <[email protected]>
fixed nits & styling issues
fixed bad conditions in bmp_bpi_(un)lock
added bgp_peer_get_send_holdtime
added bgp_peer_get_local_as

Signed-off-by: Maxence Younsi <[email protected]>
Signed-off-by: Maxence Younsi <[email protected]>
check for null stream in bmp_send_peerup_vrf (which can happen when RD is not found)

use bmp_send_all_safe instead of bmp_send_all in bmp_vrf_itf_state_changed

Signed-off-by: Maxence Younsi <[email protected]>
Signed-off-by: Maxence Younsi <[email protected]>
Signed-off-by: Maxence Younsi <[email protected]>
@mxyns
Copy link
Contributor Author

mxyns commented Nov 11, 2024

hi @riw777 @ton31337
i recently had a bit of time to rebase, fix the tests and issues. the only test failing now is unrelated apparently (missing some python packet on cleanup).
there is commit (9d795cb) in this PR that adds a safe guard to bgp_mpath_diff_insert but I believe there is an underlying issue that I put in an issue: #17377 so i can remove that commit if the issue is addressed there.
I will apply part of the styling from frrbot now. I think after that this should be good to merge, right?

Signed-off-by: Maxence Younsi <[email protected]>
@riw777
Copy link
Member

riw777 commented Nov 12, 2024

I don't think the lint issues are a problem ... we need to figure out the ci though. :-(

@mxyns
Copy link
Contributor Author

mxyns commented Nov 13, 2024

@riw777
ok for the lints

for the CI, i sadly can't fix it myself, no test runs on my machine using ASAN and everything else passes on my machine. do you know anybody that could have a look into it?

the bmp test fails only in the address sanitizer part not the normal one so maybe it is waiting time that is too low? there's also a leak found by asan which is related to "rfapiBiStartWithdrawTimer", not bmp so idk what to do?

bmp asan leak found
=================================================================
==28602==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7f1cff6b83b7 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1cff22f096 in qcalloc lib/memory.c:106
    #2 0x55fa6e7980c2 in rfapiBiStartWithdrawTimer bgpd/rfapi/rfapi_import.c:2753
    #3 0x55fa6e79e437 in rfapiProcessPeerDownRt bgpd/rfapi/rfapi_import.c:4070
    #4 0x55fa6e79e6fb in rfapiProcessPeerDown bgpd/rfapi/rfapi_import.c:4125
    #5 0x55fa6e64b331 in bgp_clear_route_all bgpd/bgp_route.c:6267
    #6 0x55fa6e5bdcfb in bgp_fsm_change_status bgpd/bgp_fsm.c:1252
    #7 0x55fa6e7505e6 in peer_delete bgpd/bgpd.c:2777
    #8 0x55fa6e756116 in bgp_delete bgpd/bgpd.c:4142
    #9 0x55fa6e510f5e in bgp_exit bgpd/bgp_main.c:204
    #10 0x55fa6e510f5e in sigint bgpd/bgp_main.c:162
    #11 0x7f1cff2b01d4 in frr_sigevent_process lib/sigevent.c:117
    #12 0x7f1cff2dbc42 in event_fetch lib/event.c:1767
    #13 0x7f1cff20ffe7 in frr_run lib/libfrr.c:1231
    #14 0x55fa6e5128f5 in main bgpd/bgp_main.c:555
    #15 0x7f1cfed0c249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Direct leak of 32 byte(s) in 1 object(s) allocated from:
    #0 0x7f1cff6b83b7 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x7f1cff22f096 in qcalloc lib/memory.c:106
    #2 0x55fa6e7980c2 in rfapiBiStartWithdrawTimer bgpd/rfapi/rfapi_import.c:2753
    #3 0x55fa6e79e437 in rfapiProcessPeerDownRt bgpd/rfapi/rfapi_import.c:4070
    #4 0x55fa6e79e6d2 in rfapiProcessPeerDown bgpd/rfapi/rfapi_import.c:4124
    #5 0x55fa6e64b331 in bgp_clear_route_all bgpd/bgp_route.c:6267
    #6 0x55fa6e5bdcfb in bgp_fsm_change_status bgpd/bgp_fsm.c:1252
    #7 0x55fa6e7505e6 in peer_delete bgpd/bgpd.c:2777
    #8 0x55fa6e756116 in bgp_delete bgpd/bgpd.c:4142
    #9 0x55fa6e510f5e in bgp_exit bgpd/bgp_main.c:204
    #10 0x55fa6e510f5e in sigint bgpd/bgp_main.c:162
    #11 0x7f1cff2b01d4 in frr_sigevent_process lib/sigevent.c:117
    #12 0x7f1cff2dbc42 in event_fetch lib/event.c:1767
    #13 0x7f1cff20ffe7 in frr_run lib/libfrr.c:1231
    #14 0x55fa6e5128f5 in main bgpd/bgp_main.c:555
    #15 0x7f1cfed0c249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: 64 byte(s) leaked in 2 allocation(s).

the other fail is also rfapi related but this time it's a cleanup test that fails. there's no error log that i found in this test and it only fails on a platform i can't test on.

****************************************************************************** Test Target Summary Pass Fail ****************************************************************************** FILE: scripts/add_routes.py 1 r1 Opened RFAPI 1 0 2 r1 Clean query 1 0 3 r1 Local registration +0.01 secs 1 0 4 r1 Query self 1 0 5 r3 Opened RFAPI 1 0 6 r3 Local registration +0.02 secs 1 0 7 r3 Self excluded 1 0 8 r3 Opened query only RFAPI 1 0 9 r3 See local 1 0 10 r4 Opened RFAPI 1 0 11 r4 Local registration +0.01 secs 1 0 12 r4 Query self 1 0 13 r4 Local registration +0.01 secs 1 0 14 r4 Query self MP 1 0 FILE: scripts/adjacencies.py 15 r1 PE->P2 (loopback) ping +14.65 secs 1 0 16 r3 PE->P2 (loopback) ping +0.00 secs 1 0 17 r4 PE->P2 (loopback) ping +0.01 secs 1 0 18 r2 Core adjacencies up +0.02 secs 1 0 19 r1 All adjacencies up +0.02 secs 1 0 20 r3 All adjacencies up +0.02 secs 1 0 21 r4 All adjacencies up +0.02 secs 1 0 22 r1 PE->PE3 (loopback) ping +0.00 secs 1 0 23 r1 PE->PE4 (loopback) ping +0.00 secs 1 0 FILE: scripts/check_routes.py 24 r1 See all registrations +1.06 secs 1 0 25 r3 See all registrations +0.01 secs 1 0 26 r4 See all registrations +0.02 secs 1 0 27 r1 VPN SAFI okay 1 0 28 r2 VPN SAFI okay 1 0 29 r3 VPN SAFI okay 1 0 30 r4 VPN SAFI okay 1 0 31 r1 Query R2s info 1 0 32 r1 Query R4s info 1 0 33 r3 Query R1s+R4s info 1 0 34 r3 Query R4s info 1 0 35 r4 Query R1s+R4s info 1 0 36 r4 Query R2s info 1 0 FILE: scripts/check_close.py 37 r1 Opened RFAPI 1 0 38 r1 Local registration +0.01 secs 1 0 39 r3 See registration +0.53 secs 1 0 40 r4 See registration +0.02 secs 1 0 41 r1 Closed RFAPI 1 0 42 r1 See cleanup +0.02 secs 1 0 43 r3 See cleanup +0.54 secs 1 0 44 r4 See cleanup +0.02 secs 1 0 45 r1 Out of holddown +0.02 secs 1 0 46 r3 Out of holddown +0.01 secs 1 0 47 r4 Out of holddown +0.01 secs 1 0 FILE: scripts/check_timeout.py 48 r1 Holddown factor not set -- skipping test 1 0 FILE: scripts/cleanup_all.py 49 r1 Local registration removed +0.02 secs 1 0 50 r1 Closed RFAPI 1 0 51 r3 Local registration removed +0.01 secs 1 0 52 r3 Closed RFAPI 1 0 53 r4 Local registration removed +0.02 secs 1 0 54 r4 Cleared NVEs 1 0 55 r1 All registrations cleared +0.01 secs 1 0 56 r3 All registrations cleared +0.01 secs 1 0 57 r4 All registrations cleared +0.01 secs 1 0 58 r1 VPN SAFI clear 0 1 59 r2 VPN SAFI clear 1 0 60 r3 VPN SAFI clear 1 0 61 r4 VPN SAFI clear 1 0 62 r1 No holddowns +0.01 secs 1 0 63 r3 No holddowns +0.01 secs 1 0 64 r4 No holddowns +0.01 secs 1 0 ****************************************************************************** Total 64 63 1 ******************************************************************************

if (afi == AFI_UNSPEC) {
zlog_warn(
"skipping bmp message for reason: can't get peer distinguisher (no RD configured)");
return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please no log message when returning 1. it will be called too many times if the vrf did not yet go up.

Copy link
Contributor Author

@mxyns mxyns Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay but how should we communicate this kind of issue then? it will be hard to troubleshoot without logs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. about bmp messages, I suggest you use a debug condition before displaying it. (debug bgp bmp ?)
  2. about the error case, actually, not having RD may not be an issue.
    this is what happens when you configure BMP on a VRF instance.
    I hesitate between:
  • maintain the peer type to RD instance peer, and fallback to RD value = 0:<vrf_id>.
  • move the peer type to Local Instance Peer, and use RD value = 0:<vrf_id>.

the only case where the message will not be sent, will be when VRF_ID is unknown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prefer the 2nd approach since it would avoid clashes for the RD instance peer where you have a vrf that fallbacks to 0:<vrf_id> and one that is actually configured with say RD == 0::2
do you want to include that in this PR as well ?

/* sends a bmp monitoring message using the given information on the bmp session
*
* if uptime is (time_t)(-1L) then do not include the timestamp in the message
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I saw the undefined timer issue in bmp mirroring messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you saw bmp mirror packets with 0-filled timestamp fields in the per peer header? i did not affect the mirroring afaik, the timestamps in mirroring messages come from the bmp mirror queue item's timestamp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I mean that if you added the test to see if time == -1, then you know there are some cases where the time value is not set. I suppose this is something that should be done later in a separate pr. Hence the need to write it somewhere in doc/developer/bmp.rst for instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. that should be already specified in doc/developer/bmp.rst

RFC8671: BMP Adj-RIB-Out

[...]

  • Per-Peer Header
    • Timestamp: not recorded, set to 0

please tell me if you want me to be more specific there


peer = QOBJ_GET_TYPESAFE(bqe->peerid, peer);
if (!peer) {
/* skipping queued item for deleted peer
*/
zlog_info("bmp: skipping queued item for deleted peer");
goto out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls no trace. it will exhaust the console.

Copy link

github-actions bot commented Dec 3, 2024

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants