Fix conflict in mgmtd on startup #15286

idryzhov · 2024-02-02T23:28:48Z

Fixes #15262
Probably fixes #15267

donaldsharp

LGTM

vtysh/vtysh_config.c

vtysh/vtysh.c

donaldsharp · 2024-02-06T16:39:03Z

vtysh/vtysh_config.c

+	 * will be fully ignored.
+	 */
+	while (vtysh_execute_no_pager("conf term file-lock") == CMD_WARNING_CONFIG_FAILED)
+		usleep(100000);


Is it possible that vtysh will never get the lock?

Not during the startup. After the startup, it is possible if someone uses conf t file-lock in one vtysh instance and trying to use vtysh -b in another terminal at the same time. But I don't think we should cover this - it's up to the user to either exit from the first vtysh instance or kill the second one. I can also add some retry limit if you feel it's necessary.

choppsv1 · 2024-02-06T19:27:53Z

Igor Ryzhov ***@***.***> writes:

Not during the startup. After the startup, it is possible if someone uses conf t file-lock in one vtysh instance and trying to use vtysh -b in another terminal at the same time. But I don't think we should cover this - it's up to the user to either exit from the first vtysh instance or kill the second one. I can also add some retry limit if you feel it's necessary.

We could print something every so often that says "Another vtysh session has the lock, will retry".

idryzhov · 2024-02-06T19:47:25Z

We could print something every so often that says "Another vtysh session has the lock, will retry".

We already do:

% Can't enter config; candidate datastore locked by another session

choppsv1

In mgmt_vty_read_configs() via mgmt_config_read_in() we are counting on this getting called prior to anything else being able to run and lock the DBs. This code is very fragile code. We are depending on this event being handled in the event loop prior to any connect from the backend being handled and gaining the lock.

It seems to me that we could handle the lock failures easily enough by returning a false from mgmt_vty_read_configs() and having mgmt_config_read_in() rerun its timer like it does if vty_mgmt_fe_enabled() returns false.

This code /is/ along a deprecated path since we are moving to unified config, but we aren't totally removing reading of per-daemon configs yet so...

choppsv1 · 2024-02-07T03:42:15Z

vtysh/vtysh.c

-DEFUNSH(VTYSH_REALLYALL, vtysh_config_terminal_file_lock,
-	vtysh_config_terminal_file_lock_cmd,
-	"configure terminal file-lock",
+	"configure [terminal [file-lock]]",


We should make file-lock a hidden command, and just leave the configure [terminal] public.

Using file-lock is not only needed for config reading, but it's also the way to work with FRR in explicit-commit mode instead of implicit-commit-after-every-command mode. However, there are two problems:

this explicit-commit mode works only for mgmtd-converted daemons

file-lock naming is confusing and doesn't explain that it enables explicit-commit mode

So maybe you're right and we should hide it until all daemons are converted to mgmtd. But also there's an option to rename file-lock to something like locked or with-commit and leave it not hidden.

Anyway, it's not really related to this fix, so can be changed in a separate PR.

vtysh/vtysh_config.c

choppsv1 · 2024-02-07T03:46:26Z

mgmtd/mgmt_txn.c

+		 * are initializing the backend.
+		 */
+		if (mgmt_ds_lock(ds_ctx, 0) != 0) {
+			__log_err("Running datatore is locked");


Is this truly an error, or can it happen if multiple backends are trying to initialize at the same time? I'd think this could be common at system startup, especially with a large configuration.

I agree, it's not really an error, I just used the same logging as used in case of mgmt_txn_create_new error below. I'd like to keep this as-is in this PR to keep code consistent, and we can change it in both places in a separate PR.

choppsv1 · 2024-02-07T03:47:41Z

mgmtd/mgmt_txn.c

+		 * Lock the running datastore to prevent any changes while we
+		 * are initializing the backend.
+		 */
+		if (mgmt_ds_lock(ds_ctx, 0) != 0) {


Why not do this prior to getting changes so we don't have to waste time creating and deleting them if another backend is currently being sync'd? This could actually be pretty expensive for very large configs.

I agree, but I didn't want to rearrange the order of operations in this fix. We're already doing the mgmt_txn_create_new after getting the changes, and it can fail because of the same thing - another running transaction. So putting the locking here seems reasonable. We can rearrange the order in a separate PR.

choppsv1 · 2024-02-07T12:00:39Z

We could print something every so often that says "Another vtysh session has the lock, will retry".

We already do:
% Can't enter config; candidate datastore locked by another session

That isn't printed in the while loop so not sure how that applies here.

idryzhov · 2024-02-07T12:24:31Z

This is printed from conf term file-lock command on every iteration of the while loop. There's no need to have duplicated messages.

idryzhov · 2024-02-07T13:15:39Z

In mgmt_vty_read_configs() via mgmt_config_read_in() we are counting on this getting called prior to anything else being able to run and lock the DBs. This code is very fragile code. We are depending on this event being handled in the event loop prior to any connect from the backend being handled and gaining the lock.

It seems to me that we could handle the lock failures easily enough by returning a false from mgmt_vty_read_configs() and having mgmt_config_read_in() rerun its timer like it does if vty_mgmt_fe_enabled() returns false.

This code /is/ along a deprecated path since we are moving to unified config, but we aren't totally removing reading of per-daemon configs yet so...

This is correct, but again not directly connected to this PR. I am fixing failing vtysh -b here. BTW, I think when all daemons are converted to mgmtd, the correct way to read the unified config would be to read it directly from mgmtd in mgmt_vty_read_configs instead of calling vtysh -b.

When FRR starts, after mgmtd is initialized, backend clients connect to it and request their config. To supply the config, mgmtd creates a configuration transaction. At the same time, `vtysh -b` tries to read the startup config and configure mgmtd, which also creates a configuration transaction. If these two actions happen at the exact same time, there's a conflict between them, because only a single configuration translaction is allowed. Because of that, vtysh fails and the config is completely ignored. When starting the config reading, vtysh locks candidate and running datastores in mgmtd. This commit adds locking of running datastore when initializing the backend client. It allows to retry locking on the vtysh side and read the config only when the lock is aquired instead of failing. This change also prevents running datastore from being changed during initialization of backend clients. This could lead to a desynchronized state between mgmtd and backends. Signed-off-by: Igor Ryzhov <[email protected]>

When exiting from a level below the config node, like `router rip`, vtysh executes a resync by sending "end" and "conf term [file-lock]" commands to all the daemons. As statet in the description comment, it's done "in case one of the daemons is somewhere else". I don't think this actually ever happens, but even if it is, it is a bug in a daemon that needs to be fixed. This resync was okay before the introduction of mgmtd, but now it unlocks and locks back the datastores during the configuration reading process, which can lead to a failure which is explained in the previous commit. Signed-off-by: Igor Ryzhov <[email protected]>

frrbot bot added bugfix mgmt FRR Management Infra vtysh labels Feb 2, 2024

github-actions bot added master size/M labels Feb 2, 2024

idryzhov mentioned this pull request Feb 2, 2024

[9.2-dev] staticd/mgmtd doesn't load static routes on OS booting in a special case with more than 11 VRFs in system and wrong networking config #15267

Closed

2 tasks

idryzhov force-pushed the fix-mgmt-startup branch from ba1c3c2 to aa7bc9e Compare February 2, 2024 23:32

ton31337 added this to the 10.0 milestone Feb 3, 2024

donaldsharp approved these changes Feb 3, 2024

View reviewed changes

choppsv1 reviewed Feb 5, 2024

View reviewed changes

vtysh/vtysh_config.c Outdated Show resolved Hide resolved

choppsv1 reviewed Feb 5, 2024

View reviewed changes

vtysh/vtysh.c Show resolved Hide resolved

idryzhov force-pushed the fix-mgmt-startup branch from aa7bc9e to b9537f0 Compare February 6, 2024 15:54

donaldsharp reviewed Feb 6, 2024

View reviewed changes

choppsv1 requested changes Feb 7, 2024

View reviewed changes

idryzhov added 2 commits February 7, 2024 18:37

idryzhov force-pushed the fix-mgmt-startup branch from b9537f0 to 2574f03 Compare February 7, 2024 16:37

choppsv1 approved these changes Feb 7, 2024

View reviewed changes

choppsv1 merged commit ab3d084 into FRRouting:master Feb 7, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix conflict in mgmtd on startup #15286

Fix conflict in mgmtd on startup #15286

idryzhov commented Feb 2, 2024

donaldsharp left a comment

donaldsharp Feb 6, 2024

idryzhov Feb 6, 2024

choppsv1 commented Feb 6, 2024 via email

idryzhov commented Feb 6, 2024

choppsv1 left a comment

choppsv1 Feb 7, 2024

idryzhov Feb 7, 2024 •

edited

Loading

choppsv1 Feb 7, 2024

idryzhov Feb 7, 2024

choppsv1 Feb 7, 2024

idryzhov Feb 7, 2024

choppsv1 commented Feb 7, 2024

idryzhov commented Feb 7, 2024

idryzhov commented Feb 7, 2024

Fix conflict in mgmtd on startup #15286

Fix conflict in mgmtd on startup #15286

Conversation

idryzhov commented Feb 2, 2024

donaldsharp left a comment

Choose a reason for hiding this comment

donaldsharp Feb 6, 2024

Choose a reason for hiding this comment

idryzhov Feb 6, 2024

Choose a reason for hiding this comment

choppsv1 commented Feb 6, 2024 via email

idryzhov commented Feb 6, 2024

choppsv1 left a comment

Choose a reason for hiding this comment

choppsv1 Feb 7, 2024

Choose a reason for hiding this comment

idryzhov Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

choppsv1 Feb 7, 2024

Choose a reason for hiding this comment

idryzhov Feb 7, 2024

Choose a reason for hiding this comment

choppsv1 Feb 7, 2024

Choose a reason for hiding this comment

idryzhov Feb 7, 2024

Choose a reason for hiding this comment

choppsv1 commented Feb 7, 2024

idryzhov commented Feb 7, 2024

idryzhov commented Feb 7, 2024

idryzhov Feb 7, 2024 •

edited

Loading