Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgmtd crashes occassionally #15262

Closed
donaldsharp opened this issue Jan 30, 2024 · 1 comment · Fixed by #15286
Closed

mgmtd crashes occassionally #15262

donaldsharp opened this issue Jan 30, 2024 · 1 comment · Fixed by #15286
Labels
triage Needs further investigation
Milestone

Comments

@donaldsharp
Copy link
Member

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=133365406277568, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x0000794b8d442476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4  0x0000794b8d93d6bd in core_handler (signo=6, siginfo=0x7fff9f7a7f70, context=0x7fff9f7a7e40) at lib/sigevent.c:248
#5  <signal handler called>
#6  __pthread_kill_implementation (no_tid=0, signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:44
#7  __pthread_kill_internal (signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:78
#8  __GI___pthread_kill (threadid=133365406277568, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#9  0x0000794b8d442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x0000794b8d4287f3 in __GI_abort () at ./stdlib/abort.c:79
#11 0x0000794b8d983bcf in _zlog_assert_failed (xref=0x576519664220 <_xref.81>, extra=0x0) at lib/zlog.c:692
#12 0x000057651962740e in mgmt_fe_session_handle_setcfg_req_msg (session=0x57651b7c80f0, setcfg_req=0x57651b7cd5a0)
    at mgmtd/mgmt_fe_adapter.c:727
#13 0x0000576519628191 in mgmt_fe_adapter_handle_msg (adapter=0x57651b7cd6d0, fe_msg=0x57651b7f7a90) at mgmtd/mgmt_fe_adapter.c:1010
#14 0x0000576519628e78 in mgmt_fe_adapter_process_msg (version=0 '\000', 
    data=0x57651b7d46c8 ":9\b\002\020\002\030\003\"-\n)\n'/frr-interface:lib/interface[name='lo']\020\001(", len=59, 
    conn=0x57651b494df0) at mgmtd/mgmt_fe_adapter.c:1285
#15 0x0000794b8d8f39b8 in mgmt_msg_procbufs (ms=0x57651b494df8, handle_msg=0x576519628ccb <mgmt_fe_adapter_process_msg>, 
    user=0x57651b494df0, debug=false) at lib/mgmt_msg.c:193
#16 0x0000794b8d8f48c2 in msg_conn_proc_msgs (thread=0x7fff9f7a9030) at lib/mgmt_msg.c:526
#17 0x0000794b8d958a0f in event_call (thread=0x7fff9f7a9030) at lib/event.c:2003
#18 0x0000794b8d8d9651 in frr_run (master=0x57651b3a7e10) at lib/libfrr.c:1214
--Type <RET> for more, q to quit, c to continue without paging--
#19 0x00005765196207e9 in main (argc=7, argv=0x7fff9f7a9268) at mgmtd/mgmt_main.c:271
(gdb) 

This routinely happens on my machine.

@donaldsharp donaldsharp added the triage Needs further investigation label Jan 30, 2024
@donaldsharp
Copy link
Member Author

From talking with Chris and Igor:

So it turns out it’s pretty easy to fix the crash by simply removing the assert:
		/* as we have the lock no-one else should have a config txn */
		assert(!mgmt_config_txn_in_progress());
This assumption is just wrong, because there can actually be another config transaction in-progress, initiated by mgmtd itself when a new daemon connects to it (Chris should now what I’m taking about).
But this actually uncovers another probably serious problem. This kind of situation can easily happen during startup, because at that moment we have daemons starting and connecting to mgmtd (first config transaction) and also the config file is being read (second config transaction). If we remove the assert, the crash would obviously disappear, but if this race condition happens, then the config reading would fail. And as you can reproduce it easily, it means that it’s not something that can’t happen in real world.
I don’t have a solution for this right now. Maybe we should reschedule the transaction if such a situation happens. But should we do that inside of mgmtd or in vty when an error is returned? How many times should we retry? It’s 2AM here so I have to go sleep now, but maybe Chris can take a look and propose something. Any ideas are welcome.
BTW just want to mention that it’s not something new, it was there from the very beginning of mgmtd, but we didn’t encounter this problem when only staticd was converted. Like a lot of other issues in mgmtd we identified and fixed recently during other daemons conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants