Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFD regression in 10.2 #17452

Closed
2 tasks done
stgraber opened this issue Nov 17, 2024 · 12 comments
Closed
2 tasks done

BFD regression in 10.2 #17452

stgraber opened this issue Nov 17, 2024 · 12 comments

Comments

@stgraber
Copy link

Description

I just upgraded a bunch of routers from FRR 10.1 to 10.2 and noticed that a bunch of my BGP peers were down. Looking closer I noticed that they were all peers that I had BFD enabled on.

Version

frr01# show version
FRRouting 10.2 (frr01) on Linux(6.11.7-zabbly+).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
    '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--sbindir=/usr/lib/frr' '--with-vtysh-pager=/usr/bin/pager' '--libdir=/usr/lib/x86_64-linux-gnu/frr' '--with-moduledir=/usr/lib/x86_64-linux-gnu/frr/modules' '--disable-dependency-tracking' '--enable-rpki' '--disable-scripting' '--enable-pim6d' '--disable-grpc' '--with-libpam' '--enable-doc' '--enable-doc-html' '--enable-snmp' '--enable-fpm' '--disable-protobuf' '--disable-zeromq' '--enable-ospfapi' '--enable-bgp-vnc' '--enable-multipath=256' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' 'build_alias=x86_64-linux-gnu' 'PYTHON=python3'

How to reproduce

Hard to tell, it looks like upgrading from 10.1 and 10.2 when you have a bunch of peer-groups configured to use bfd will cause the issue. But I suspect it's somewhat racy as I've found that messing with the config and reloading does eventually nudge bfd into sending packets again.

Expected behavior

bfd peers should have frr send out bfd frames and once both sides are doing that, bring up the session

Actual behavior

frr doesn't appear to be sending out bfd frames at all, at least until something nudges that part of the config at which point a reload will kick it back on and fix the issue

Additional context

I'm sorry this is a bit light on details. As most my peers don't use bfd I didn't immediately notice the issue, so I then had to rush a revert and therefore can't provide too many details on what happened.

The config is pretty straightforward though with a peer group like this one:

 neighbor DCMTL-V4 peer-group
 neighbor DCMTL-V4 remote-as 399760
 neighbor DCMTL-V4 bfd
 neighbor DCMTL-V6 peer-group
 neighbor DCMTL-V6 remote-as 399760
 neighbor DCMTL-V6 bfd

Which then has a few peers inside of the group like those:

 neighbor 45.45.148.162 peer-group DCMTL-V4
 neighbor 45.45.148.162 description DCMTL-V4
 neighbor 45.45.148.163 peer-group DCMTL-V4
 neighbor 45.45.148.163 description DCMTL-V4
 neighbor 2602:fc62:a:200::162 peer-group DCMTL-V6
 neighbor 2602:fc62:a:200::162 description DCMTL-V6
 neighbor 2602:fc62:a:200::163 peer-group DCMTL-V6
 neighbor 2602:fc62:a:200::163 description DCMTL-V6

No other bfd config in frr.conf and all those peers are directly connected (no multi-hop) and there's no other config for those peers which could cause issues here (like alternative source-ip or the like).

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@stgraber stgraber added the triage Needs further investigation label Nov 17, 2024
@ton31337
Copy link
Member

ton31337 commented Nov 17, 2024

Can we get the debug logs (debug bgp bfd)? Also show bfd peers.

@draggeta
Copy link

We're having the same issue. Just upgraded and BFD is flakey. I don't know if we need to have BGP peers in a group, but sometimes when a connection comes up, bfd stays down.

BFD Peers:
	peer fd33::3 vrf CORE
		ID: 2469380704
		Remote ID: 1992622085
		Active mode
		Status: down
		Downtime: 32 second(s)
		Diagnostics: neighbor signaled session down
		Remote diagnostics: ok
		Peer Type: configured
		RTT min/avg/max: 0/0/0 usec
		Local timers:
			Detect-multiplier: 3
			Receive interval: 300ms
			Transmission interval: 300ms
			Echo receive interval: 50ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 1000ms
			Transmission interval: 1000ms
			Echo receive interval: disabled

	peer fd33::1 vrf CORE
		ID: 4259795298
		Remote ID: 748027914
		Active mode
		Status: down
		Downtime: 32 second(s)
		Diagnostics: neighbor signaled session down
		Remote diagnostics: ok
		Peer Type: configured
		RTT min/avg/max: 0/0/0 usec
		Local timers:
			Detect-multiplier: 3
			Receive interval: 300ms
			Transmission interval: 300ms
			Echo receive interval: 50ms
			Echo transmission interval: disabled
		Remote timers:
			Detect-multiplier: 3
			Receive interval: 1000ms
			Transmission interval: 1000ms
			Echo receive interval: disabled

My configuration is similar to @stgraber. The debug output doesn't seem to generate much. Only got some output when I disabled BFD for the BGP group:

2024/11/20 23:10:47 BGP: [JG0WZ-7X009][EC 33554504] fd33::3 unrecognized capability code: 128 - ignored
2024/11/20 23:10:47 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from fd33::3 in vrf CORE
2024/11/20 23:10:59 BGP: [JG0WZ-7X009][EC 33554504] fd33::1 unrecognized capability code: 128 - ignored
2024/11/20 23:10:59 BGP: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv6 Unicast from fd33::1 in vrf CORE

@ton31337 ton31337 self-assigned this Nov 23, 2024
@ton31337
Copy link
Member

Could you also send debug bfd peer output?

@ychamps
Copy link

ychamps commented Dec 17, 2024

Hi ,
I have a fresh FRR 10.2 installed and got issues with BFD session establishment.
The woraround is to remove all BFD configuration and add only :
peer 192.168.246.22 local-address 192.168.246.17

No BFD config in BGP section.

Regards.

@ne-vlezay80
Copy link
Contributor

please provide tcpdump

@ne-vlezay80
Copy link
Contributor

set local-address on bfd peer or set local source address from routemap

@ne-vlezay80
Copy link
Contributor

If configure bfd from bgp peers and restart frr, bfd peer is not avail from bgp peers.

@ne-vlezay80
Copy link
Contributor

neighbor 172.30.255.25 bfd

neigh# show bfd peer BFD Peers:

@ne-vlezay80
Copy link
Contributor

Can we get the debug logs (debug bgp bfd)? Also show bfd peers.

Jan  3 05:17:45 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]
Jan  3 05:17:46 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]
Jan  3 05:17:47 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]
Jan  3 05:17:48 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]
Jan  3 05:17:48 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]
Jan  3 05:17:49 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]
Jan  3 05:17:50 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]
Jan  3 05:17:51 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]
Jan  3 05:17:52 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]
Jan  3 05:17:53 neigh daemon.debug bfdd[407]: [YA0Q5-C0BPV] control-packet: 'remote discriminator' is zero, not overridden [mhop:no peer:172.30.255.24 local:172.30.255.25 port:14]

@ton31337
Copy link
Member

ton31337 commented Jan 3, 2025

This does not look like a debug, do you have log syslog debug enabled?

@ne-vlezay80
Copy link
Contributor

This does not look like a debug, do you have log syslog debug enabled?

2025-01-03T07:56:48.584012+00:00 neigh zebra[1128]: [N5M5Y-J5BPG][EC 4043309121] Client 'bfd' (session id 0) encountered an error and is shutting down.
2025-01-03T07:56:48.588317+00:00 neigh zebra[1128]: [JPSA8-5KYEA] client 29 disconnected 0 bfd routes removed from the rib
2025-01-03T07:56:48.588349+00:00 neigh zebra[1128]: [S929C-NZR3N] client 29 disconnected 0 bfd nhgs removed from the rib
2025-01-03T07:56:49.530674+00:00 neigh watchfrr[1572]: [ZCJ3S-SPH5S] bfdd state -> down : initial connection attempt failed
2025-01-03T07:56:49.747229+00:00 neigh bfdd[1590]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-01-03T07:56:49.764947+00:00 neigh watchfrr[1572]: [QDG3Y-BY5TN] bfdd state -> up : connect succeeded
2025-01-03T07:58:32.043066+00:00 neigh /etc/init.d/fixbfdbug[1699]: start-stop-daemon: no matching processes found
2025-01-03T07:58:32.177015+00:00 neigh zebra[1585]: [N5M5Y-J5BPG][EC 4043309121] Client 'bfd' (session id 0) encountered an error and is shutting down.
2025-01-03T07:58:32.179758+00:00 neigh zebra[1585]: [JPSA8-5KYEA] client 29 disconnected 0 bfd routes removed from the rib
2025-01-03T07:58:32.179792+00:00 neigh zebra[1585]: [S929C-NZR3N] client 29 disconnected 0 bfd nhgs removed from the rib
2025-01-03T07:58:32.197491+00:00 neigh bgpd[1595]: [MNSF9-KVB43] _bfd_sess_send: BFD session 172.30.255.25 -> 172.30.255.24 interface qt-swep0 VRF default(0) was not uninstalled
2025-01-03T07:58:33.157461+00:00 neigh watchfrr[1857]: [ZCJ3S-SPH5S] bfdd state -> down : initial connection attempt failed
2025-01-03T07:58:33.372598+00:00 neigh bfdd[1875]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-01-03T07:58:33.386196+00:00 neigh watchfrr[1857]: [QDG3Y-BY5TN] bfdd state -> up : connect succeeded
2025-01-03T08:06:25.491750+00:00 neigh zebra[1870]: [N5M5Y-J5BPG][EC 4043309121] Client 'bfd' (session id 0) encountered an error and is shutting down.
2025-01-03T08:06:25.502881+00:00 neigh zebra[1870]: [JPSA8-5KYEA] client 29 disconnected 0 bfd routes removed from the rib
2025-01-03T08:06:25.502919+00:00 neigh zebra[1870]: [S929C-NZR3N] client 29 disconnected 0 bfd nhgs removed from the rib
2025-01-03T08:06:28.676170+00:00 neigh watchfrr[424]: [ZCJ3S-SPH5S] bfdd state -> down : initial connection attempt failed
2025-01-03T08:06:28.880708+00:00 neigh bfdd[442]: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025-01-03T08:06:28.893631+00:00 neigh watchfrr[424]: [QDG3Y-BY5TN] bfdd state -> up : connect succeeded

@ton31337 ton31337 removed the triage Needs further investigation label Jan 3, 2025
@ton31337
Copy link
Member

ton31337 commented Jan 3, 2025

Duplicate with #17751, closing this one.

@ton31337 ton31337 closed this as completed Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants