Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGP session remains down, nexthop_set failed, bgp_getsockname() failed #12792

Closed
1 of 2 tasks
SwimGeek opened this issue Feb 11, 2023 · 11 comments
Closed
1 of 2 tasks

BGP session remains down, nexthop_set failed, bgp_getsockname() failed #12792

SwimGeek opened this issue Feb 11, 2023 · 11 comments
Labels
triage Needs further investigation

Comments

@SwimGeek
Copy link

SwimGeek commented Feb 11, 2023

Describe the bug

I upgraded from FRR 8.4.1 to FRR 8.4.2 and rebooted the router. All v4 and v6 BGP sessions came back up, except for one v4 session on a vlan interface. I had to down and up the interface for the BGP session to come back up.

I remember running into a similar situation a while back, it was a netlink buffer size thing: issue #10404.

I suspect the vlan interface name maybe confused zebra in some way, or maybe there was a limit to the number of interfaces it was happy with: vtysh -c "show interface brief" | wc -l = 56

I noticed the following in the log file:

Feb 11 03:30:43 cpt-ter-r1 bgpd[828]: [VX6SM-8YE5W][EC 33554460] 196.250.236.nnn: nexthop_set failed, resetting connection - intf 0x0
Feb 11 03:30:43 cpt-ter-r1 bgpd[828]: [NQGZV-Y3W62][EC 100663299] bgp_connect_success: bgp_getsockname(): failed for peer 196.250.236.nnn, fd 122
Feb 11 03:30:43 cpt-ter-r1 bgpd[828]: [HZN6M-XRM1G] %NOTIFICATION: sent to neighbor 196.250.236.nnn 5/0 (Neighbor Events Error/Unspecific) 0 bytes
Feb 11 03:30:43 cpt-ter-r1 bgpd[828]: [J7484-2SYXF][EC 33554465] 196.250.236.nnn [FSM] Failure handling event TCP_connection_open in state Connect, prior events BGP_Start, BGP_Start, fd 122

I could ping the neighbor v4 IP, and the IPv6 BGP session on the same interface was up.

I then used ifdown to bring down the vlan interface and brought it back up:
ifdown enp3s0.1576
ifup enp3s0.1576

After this both the v4 and v6 BGP sessions on the interface were up, with the following in the log:

Feb 11 03:34:24 cpt-ter-r1 bgpd[828]: [VCGF0-X62M1][EC 100663301] INTERFACE_STATE: Cannot find IF enp3s0.1576 in VRF 0
Feb 11 03:34:43 cpt-ter-r1 bgpd[828]: [N9HHH-F8H1M] %ADJCHANGE: neighbor 196.250.236.nnn(Unknown) in vrf default Up
Feb 11 03:34:45 cpt-ter-r1 bgpd[828]: [ZM2F8-MV4BJ][EC 33554509] Interface: enp3s0.1576 does not have a v6 LL address associated with it, waiting until one is created for it
Feb 11 03:34:54 cpt-ter-r1 zebra[739]: [WPPMZ-G9797] if_zebra_speed_update: enp3s0.1576 old speed: 0 new speed: 10000
  • Did you check if this is a duplicate issue?
  • Did you test it on the latest FRRouting/frr master branch?

To Reproduce

Hard to say. I'm not sure what caused the problem.

Expected behavior

Start with all BGP sessions coming back up.

Versions

  • OS Version: Debian 11.6
  • Kernel: 5.10.162
  • FRR Version: 8.4.2
@SwimGeek SwimGeek added the triage Needs further investigation label Feb 11, 2023
@SwimGeek
Copy link
Author

Also, why does it print '(Unknown)' in :'%ADJCHANGE: neighbor 196.250.236.nnn(Unknown)' log entry? Does not seem to be a DNS reverse lookup.

@SwimGeek
Copy link
Author

Update: Rebooted the router - this time all the BGP sessions came up without any issues. Did not make any changes.

@ton31337
Copy link
Member

Also, why does it print '(Unknown)' in :'%ADJCHANGE: neighbor 196.250.236.nnn(Unknown)' log entry? Does not seem to be a DNS reverse lookup.

Unknown means that BGP FQDN Capability is not received which means unknown FQDN.

Regarding does not have a v6 LL address associated with it, waiting until one is created for it, it's misleading and it's fixed in 8.5 (and master). 8.5 gonna be released soon.

@SwimGeek
Copy link
Author

Hi, thanks for answering the 'Unknown' question.

BGP sessions all started after a 2nd reboot, but I still think something caused that one session to break, and only start after taking down and bringing up the interface. Do you have a test for larger numbers of interfaces with longer than usual names?

@jnicholsDLR
Copy link

jnicholsDLR commented Feb 17, 2023

I just ran into this same thing today. Only a single interface was affected. The connected devices could ping each other just fine, ip address showed up in kernel but not in frr. Resolved immediately by ifdown then ifup on the interface. Excluding my loopback, mgmt, and a vrf interface, I have four 'physical' interfaces, two of which are being split into two vlan interfaces a piece.

I have not yet found a way to reliably reproduce the issue, as I have only encountered it just this once after maybe 50 - 100 times configuring interfaces exactly like this.

FRR: 8.4-0
OS: ubuntu20.04.1
Kernel: 5.4.0-136-generic
Logs:

Feb 17 21:31:42 vr2 bgpd[12844]: [VX6SM-8YE5W][EC 33554460] 169.254.0.41: nexthop_set failed, resetting connection - intf 0x0
Feb 17 21:31:42 vr2 bgpd[12844]: [NQGZV-Y3W62][EC 100663299] bgp_connect_success: bgp_getsockname(): failed for peer 169.254.0.41, fd 25
Feb 17 21:31:42 vr2 bgpd[12844]: [HZN6M-XRM1G] %NOTIFICATION: sent to neighbor 169.254.0.41 5/0 (Neighbor Events Error/Unspecific) 0 bytes
Feb 17 21:31:43 vr2 bgpd[12844]: [VCGF0-X62M1][EC 100663301] INTERFACE_STATE: Cannot find IF enp2s0f3.400 in VRF 7
Feb 17 21:31:43 vr2 bgpd[12844]: [YNQCS-MR20J][EC 100663301] INTERFACE_VRF_UPDATE: Cannot find IF enp2s0f3.400 in VRF 7
## down interface ******************************************************
Feb 17 21:31:43 vr2 systemd-networkd[559]: enp2s0f3.400: Link DOWN
Feb 17 21:31:43 vr2 systemd-networkd[559]: enp2s0f3.400: Lost carrier
Feb 17 21:31:43 vr2 systemd-networkd[559]: enp2s0f3.400: Link UP
Feb 17 21:31:43 vr2 systemd-networkd[559]: enp2s0f3.400: Gained carrier
Feb 17 21:31:43 vr2 systemd-networkd[559]: enp2s0f3.400: Link DOWN
Feb 17 21:31:43 vr2 systemd-networkd[559]: enp2s0f3.400: Lost carrier
Feb 17 21:31:43 vr2 bgpd[12844]: [VCGF0-X62M1][EC 100663301] INTERFACE_STATE: Cannot find IF enp2s0f3.400 in VRF 0
Feb 17 21:31:50 vr2 networkd-dispatcher[608]: WARNING:Unknown index 12 seen, reloading interface list
## up interface ******************************************************
Feb 17 21:31:50 vr2 systemd-networkd[559]: enp2s0f3.400: Link UP
Feb 17 21:31:50 vr2 systemd-networkd[559]: enp2s0f3.400: Gained carrier
Feb 17 21:31:50 vr2 bgpd[12844]: [VCGF0-X62M1][EC 100663301] INTERFACE_STATE: Cannot find IF enp2s0f3.400 in VRF 0
Feb 17 21:31:50 vr2 bgpd[12844]: [YNQCS-MR20J][EC 100663301] INTERFACE_VRF_UPDATE: Cannot find IF enp2s0f3.400 in VRF 0
Feb 17 21:31:50 vr2 systemd-udevd[17122]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 17 21:31:50 vr2 systemd-udevd[17122]: Using default interface naming scheme 'v245'.
Feb 17 21:31:50 vr2 bgpd[12844]: [N9HHH-F8H1M] %ADJCHANGE: neighbor 169.254.0.41(ce2) in vrf vrf-data Up
Feb 17 21:31:50 vr2 bgpd[12844]: [M59KS-A3ZXZ] bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from 169.254.0.41 in vrf vrf-data

@ongolaboy
Copy link

Hi,

for the record ,

[...]
I noticed the following in the log file:

Feb 11 03:30:43 cpt-ter-r1 bgpd[828]: [VX6SM-8YE5W][EC 33554460] 196.250.236.nnn: nexthop_set failed, resetting connection - intf 0x0
Feb 11 03:30:43 cpt-ter-r1 bgpd[828]: [NQGZV-Y3W62][EC 100663299] bgp_connect_success: bgp_getsockname(): failed for peer 196.250.236.nnn, fd 122
Feb 11 03:30:43 cpt-ter-r1 bgpd[828]: [HZN6M-XRM1G] %NOTIFICATION: sent to neighbor 196.250.236.nnn 5/0 (Neighbor Events Error/Unspecific) 0 bytes
Feb 11 03:30:43 cpt-ter-r1 bgpd[828]: [J7484-2SYXF][EC 33554465] 196.250.236.nnn [FSM] Failure handling event TCP_connection_open in state Connect, prior events BGP_Start, BGP_Start, fd 122

Mine was several lines like this

bgp_connect_success: bgp_getsockname(): failed for peer XXXXXXXXX, fd 28
XXXXXXXXX: nexthop_set failed, resetting connection - intf 0x0

Only with the IPv4 address of the peer. IPv6 was working fine.

I could ping the neighbor v4 IP, and the IPv6 BGP session on the same interface was up.

Same here.

I then used ifdown to bring down the vlan interface and brought it back up: ifdown enp3s0.1576 ifup enp3s0.1576

I only removed my IPv4 address on that interface and I added it again.
ip addr del XXXXXXXXX dev eno3.2001
ip addr add XXXXXXXXX dev eno3.2001
And the BGP IPv4 AFI session went up.

Versions

* OS Version: Debian 11.6

same

* Kernel: 5.10.162

5.10.0-21-amd64

* FRR Version: 8.4.2

same

@SwimGeek
Copy link
Author

SwimGeek commented Mar 2, 2023

Hi

Can I request that this ticket / issue be opened again?

@mehrdadrad
Copy link

mehrdadrad commented May 8, 2023

I got same error (nexthop_set failed, resetting connection - intf 0x0) with FRR 8.2.2 (Fedora 36, Kernel: 5.17.12-300) and it fixed after a restarting, is this issue fixed at 8.5? @ton31337

BGP: [VX6SM-8YE5W][EC 33554460] 192.168.1.29: nexthop_set failed, resetting connection - intf 0x0
BGP: [NQGZV-Y3W62][EC 100663299] bgp_connect_success: bgp_getsockname(): failed for peer 192.168.1.29, fd 27

@SwimGeek
Copy link
Author

Hi

I just noticed this problem in v8.5.1 - not resolved yet.

@SwimGeek
Copy link
Author

SwimGeek commented May 12, 2023

Hi

Pretty much exactly the same thing happened with new FRR version (8.5.1). BGP session did not come up. Had to down and up the VLAN interface.

From the log:

May 12 05:04:48 cpt-ter-r1 bgpd[749]: [RZDXH-R2WAC][EC 33554461] 196.250.236.145: nexthop_set failed, resetting connection - intf (Unknown)
May 12 05:04:48 cpt-ter-r1 bgpd[749]: [NQGZV-Y3W62][EC 100663299] bgp_connect_success: bgp_getsockname(): failed for peer 196.250.236.145, fd 122
May 12 05:04:48 cpt-ter-r1 bgpd[749]: [HZN6M-XRM1G] %NOTIFICATION: sent to neighbor 196.250.236.145 5/0 (Neighbor Events Error/Unspecific) 0 bytes
May 12 05:04:48 cpt-ter-r1 bgpd[749]: [YEXRC-B595X][EC 33554466] 196.250.236.145 [FSM] Failure handling event TCP_connection_open in state Connect, prior events BGP_Start, BGP_Start, fd 122

@straussmarkus
Copy link

I still have this problem in 8.5.4 on OPNsense 23.7.10-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

6 participants