mgmtd crashes occassionally #15262

donaldsharp · 2024-01-30T17:57:09Z

(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=133365406277568, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x0000794b8d442476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4  0x0000794b8d93d6bd in core_handler (signo=6, siginfo=0x7fff9f7a7f70, context=0x7fff9f7a7e40) at lib/sigevent.c:248
#5  <signal handler called>
#6  __pthread_kill_implementation (no_tid=0, signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:44
#7  __pthread_kill_internal (signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:78
#8  __GI___pthread_kill (threadid=133365406277568, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#9  0x0000794b8d442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x0000794b8d4287f3 in __GI_abort () at ./stdlib/abort.c:79
#11 0x0000794b8d983bcf in _zlog_assert_failed (xref=0x576519664220 <_xref.81>, extra=0x0) at lib/zlog.c:692
#12 0x000057651962740e in mgmt_fe_session_handle_setcfg_req_msg (session=0x57651b7c80f0, setcfg_req=0x57651b7cd5a0)
    at mgmtd/mgmt_fe_adapter.c:727
#13 0x0000576519628191 in mgmt_fe_adapter_handle_msg (adapter=0x57651b7cd6d0, fe_msg=0x57651b7f7a90) at mgmtd/mgmt_fe_adapter.c:1010
#14 0x0000576519628e78 in mgmt_fe_adapter_process_msg (version=0 '\000', 
    data=0x57651b7d46c8 ":9\b\002\020\002\030\003\"-\n)\n'/frr-interface:lib/interface[name='lo']\020\001(", len=59, 
    conn=0x57651b494df0) at mgmtd/mgmt_fe_adapter.c:1285
#15 0x0000794b8d8f39b8 in mgmt_msg_procbufs (ms=0x57651b494df8, handle_msg=0x576519628ccb <mgmt_fe_adapter_process_msg>, 
    user=0x57651b494df0, debug=false) at lib/mgmt_msg.c:193
#16 0x0000794b8d8f48c2 in msg_conn_proc_msgs (thread=0x7fff9f7a9030) at lib/mgmt_msg.c:526
#17 0x0000794b8d958a0f in event_call (thread=0x7fff9f7a9030) at lib/event.c:2003
#18 0x0000794b8d8d9651 in frr_run (master=0x57651b3a7e10) at lib/libfrr.c:1214
--Type <RET> for more, q to quit, c to continue without paging--
#19 0x00005765196207e9 in main (argc=7, argv=0x7fff9f7a9268) at mgmtd/mgmt_main.c:271
(gdb)

This routinely happens on my machine.

The text was updated successfully, but these errors were encountered:

donaldsharp · 2024-01-30T17:57:29Z

From talking with Chris and Igor:

So it turns out it’s pretty easy to fix the crash by simply removing the assert:
		/* as we have the lock no-one else should have a config txn */
		assert(!mgmt_config_txn_in_progress());
This assumption is just wrong, because there can actually be another config transaction in-progress, initiated by mgmtd itself when a new daemon connects to it (Chris should now what I’m taking about).
But this actually uncovers another probably serious problem. This kind of situation can easily happen during startup, because at that moment we have daemons starting and connecting to mgmtd (first config transaction) and also the config file is being read (second config transaction). If we remove the assert, the crash would obviously disappear, but if this race condition happens, then the config reading would fail. And as you can reproduce it easily, it means that it’s not something that can’t happen in real world.
I don’t have a solution for this right now. Maybe we should reschedule the transaction if such a situation happens. But should we do that inside of mgmtd or in vty when an error is returned? How many times should we retry? It’s 2AM here so I have to go sleep now, but maybe Chris can take a look and propose something. Any ideas are welcome.
BTW just want to mention that it’s not something new, it was there from the very beginning of mgmtd, but we didn’t encounter this problem when only staticd was converted. Like a lot of other issues in mgmtd we identified and fixed recently during other daemons conversion.

donaldsharp added the triage Needs further investigation label Jan 30, 2024

Jafaral added this to the 10.0 milestone Jan 30, 2024

idryzhov mentioned this issue Feb 2, 2024

Fix conflict in mgmtd on startup #15286

Merged

choppsv1 closed this as completed in #15286 Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mgmtd crashes occassionally #15262

mgmtd crashes occassionally #15262

donaldsharp commented Jan 30, 2024

donaldsharp commented Jan 30, 2024

mgmtd crashes occassionally #15262

mgmtd crashes occassionally #15262

Comments

donaldsharp commented Jan 30, 2024

donaldsharp commented Jan 30, 2024