Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zebra process crashes intermittently during 'config reload' on the DUT line cards #14092

Closed
mlok-nokia opened this issue Jul 25, 2023 · 7 comments
Labels
triage Needs further investigation

Comments

@mlok-nokia
Copy link


Describe the bug
On a T2 chassis line card, when we do 'sudo config reload -y', we see 'zebra' process getting crashed and generates a core. We see this issue intermittently happening. (_~ approx once in 30 attempts or so_)

We have started seeing the issue from this commit

sonic-buildimage-msft commit:
Azure/sonic-buildimage-msft@6f19e12

Following logs are seen on the bgp docker, when the crash is happening.

2023-07-09 13:59:40,064 INFO exited: zebra (terminated by SIGSEGV (core dumped); not expected)
2023-07-11 19:39:22,156 INFO exited: zebra (terminated by SIGSEGV (core dumped); not expected)

To Reproduce
Steps to reproduce the behavior:
On any T2 chassis line card, do 'sudo config reload -y' for multiple times.

Expected behavior
sudo config reload' on DUT line cards, should not cause any issue. And the line cards should come up fine with all bgp neighbors established without any crash/core files.

Actual behavior
Zebra process under bgp docker gets crashed.
Core generated
We had already raised an issue under sonic-buildimage regarding this crash, please take a look at this,
sonic-net/sonic-buildimage#15803
frr.zip
zebra.1689104360.44.0.core.gz

Screenshots

Versions

admin@ixre-egl-board1:~$ show version

SONiC Software Version: SONiC.HEAD.489499-msft-2205-ndk-d963ac161
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: d963ac161
Build date: Fri Jul  7 18:18:51 UTC 2023
Built by: gitlab-runner@sonic-bld2

Platform: x86_64-nokia_ixr7250e_36x400g-r0
HwSKU: Nokia-IXR7250E-36x100G
ASIC: broadcom
ASIC Count: 2
Serial Number: EAG2-04-210
Model Number: N/A
Hardware Revision: 56
Uptime: 15:45:52 up 1 day, 12:15,  3 users,  load average: 1.56, 1.54, 1.59
Date: Wed 12 Jul 2023 15:45:52
  • FRR Version: frr_8.2.2-sonic-0_amd64.deb

Additional context
image

@mlok-nokia mlok-nokia added the triage Needs further investigation label Jul 25, 2023
@mlok-nokia
Copy link
Author

@mlok-nokia
Copy link
Author

After we checked the previous test history, we found this crash is seen in April OC test run.

@donaldsharp
Copy link
Member

8.2 is now considered very old and is typically not supported from a FRR community perspective at this point in time.

First thing is that I would try this fix:
commit 0eaa652
Author: Donald Sharp [email protected]
Date: Fri May 19 09:54:05 2023 -0400

zebra: Do not allow old FPM to access freed memory after shutdown

If that doesn't work I would recommend upgrading to a newer release and then trying that to see if the problem still exists

@eqvinox
Copy link
Contributor

eqvinox commented Jul 25, 2023

@mlok-nokia I would recommend opening this as a bug with Sonic. Basically Sonic is trying to maintain an older stable version which is not really active in the FRRouting community anymore. We can look at this if the problem can be demonstrated in 8.5 or master, but the work to reproduce the issue on a newer version is with you / Sonic.

@abdosi
Copy link

abdosi commented Aug 3, 2023

8.2 is now considered very old and is typically not supported from a FRR community perspective at this point in time.

First thing is that I would try this fix: commit 0eaa652 Author: Donald Sharp [email protected] Date: Fri May 19 09:54:05 2023 -0400

zebra: Do not allow old FPM to access freed memory after shutdown

If that doesn't work I would recommend upgrading to a newer release and then trying that to see if the problem still exists

@mlok-nokia can you try with this patch and see if it helps.

Copy link

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

@frrbot
Copy link

frrbot bot commented Jan 31, 2024

This issue will be automatically closed in the specified period unless there is further activity.

@frrbot frrbot bot closed this as completed Feb 7, 2024
@frrbot frrbot bot removed the autoclose label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

4 participants