You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=133365406277568, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x0000794b8d442476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
#4 0x0000794b8d93d6bd in core_handler (signo=6, siginfo=0x7fff9f7a7f70, context=0x7fff9f7a7e40) at lib/sigevent.c:248
#5 <signal handler called>
#6 __pthread_kill_implementation (no_tid=0, signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:44
#7 __pthread_kill_internal (signo=6, threadid=133365406277568) at ./nptl/pthread_kill.c:78
#8 __GI___pthread_kill (threadid=133365406277568, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#9 0x0000794b8d442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#10 0x0000794b8d4287f3 in __GI_abort () at ./stdlib/abort.c:79
#11 0x0000794b8d983bcf in _zlog_assert_failed (xref=0x576519664220 <_xref.81>, extra=0x0) at lib/zlog.c:692
#12 0x000057651962740e in mgmt_fe_session_handle_setcfg_req_msg (session=0x57651b7c80f0, setcfg_req=0x57651b7cd5a0)
at mgmtd/mgmt_fe_adapter.c:727
#13 0x0000576519628191 in mgmt_fe_adapter_handle_msg (adapter=0x57651b7cd6d0, fe_msg=0x57651b7f7a90) at mgmtd/mgmt_fe_adapter.c:1010
#14 0x0000576519628e78 in mgmt_fe_adapter_process_msg (version=0 '\000',
data=0x57651b7d46c8 ":9\b\002\020\002\030\003\"-\n)\n'/frr-interface:lib/interface[name='lo']\020\001(", len=59,
conn=0x57651b494df0) at mgmtd/mgmt_fe_adapter.c:1285
#15 0x0000794b8d8f39b8 in mgmt_msg_procbufs (ms=0x57651b494df8, handle_msg=0x576519628ccb <mgmt_fe_adapter_process_msg>,
user=0x57651b494df0, debug=false) at lib/mgmt_msg.c:193
#16 0x0000794b8d8f48c2 in msg_conn_proc_msgs (thread=0x7fff9f7a9030) at lib/mgmt_msg.c:526
#17 0x0000794b8d958a0f in event_call (thread=0x7fff9f7a9030) at lib/event.c:2003
#18 0x0000794b8d8d9651 in frr_run (master=0x57651b3a7e10) at lib/libfrr.c:1214
--Type <RET> for more, q to quit, c to continue without paging--
#19 0x00005765196207e9 in main (argc=7, argv=0x7fff9f7a9268) at mgmtd/mgmt_main.c:271
(gdb)
This routinely happens on my machine.
The text was updated successfully, but these errors were encountered:
So it turns out it’s pretty easy to fix the crash by simply removing the assert:
/* as we have the lock no-one else should have a config txn */
assert(!mgmt_config_txn_in_progress());
This assumption is just wrong, because there can actually be another config transaction in-progress, initiated by mgmtd itself when a new daemon connects to it (Chris should now what I’m taking about).
But this actually uncovers another probably serious problem. This kind of situation can easily happen during startup, because at that moment we have daemons starting and connecting to mgmtd (first config transaction) and also the config file is being read (second config transaction). If we remove the assert, the crash would obviously disappear, but if this race condition happens, then the config reading would fail. And as you can reproduce it easily, it means that it’s not something that can’t happen in real world.
I don’t have a solution for this right now. Maybe we should reschedule the transaction if such a situation happens. But should we do that inside of mgmtd or in vty when an error is returned? How many times should we retry? It’s 2AM here so I have to go sleep now, but maybe Chris can take a look and propose something. Any ideas are welcome.
BTW just want to mention that it’s not something new, it was there from the very beginning of mgmtd, but we didn’t encounter this problem when only staticd was converted. Like a lot of other issues in mgmtd we identified and fixed recently during other daemons conversion.
This routinely happens on my machine.
The text was updated successfully, but these errors were encountered: