Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix conflict in mgmtd on startup #15286

Merged
merged 2 commits into from
Feb 7, 2024
Merged

Conversation

idryzhov
Copy link
Contributor

@idryzhov idryzhov commented Feb 2, 2024

Fixes #15262
Probably fixes #15267

Copy link
Member

@donaldsharp donaldsharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

vtysh/vtysh_config.c Outdated Show resolved Hide resolved
vtysh/vtysh.c Show resolved Hide resolved
* will be fully ignored.
*/
while (vtysh_execute_no_pager("conf term file-lock") == CMD_WARNING_CONFIG_FAILED)
usleep(100000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that vtysh will never get the lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not during the startup. After the startup, it is possible if someone uses conf t file-lock in one vtysh instance and trying to use vtysh -b in another terminal at the same time. But I don't think we should cover this - it's up to the user to either exit from the first vtysh instance or kill the second one. I can also add some retry limit if you feel it's necessary.

@choppsv1
Copy link
Contributor

choppsv1 commented Feb 6, 2024 via email

@idryzhov
Copy link
Contributor Author

idryzhov commented Feb 6, 2024

We could print something every so often that says "Another vtysh session has the lock, will retry".

We already do:

% Can't enter config; candidate datastore locked by another session

Copy link
Contributor

@choppsv1 choppsv1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In mgmt_vty_read_configs() via mgmt_config_read_in() we are counting on this getting called prior to anything else being able to run and lock the DBs. This code is very fragile code. We are depending on this event being handled in the event loop prior to any connect from the backend being handled and gaining the lock.

It seems to me that we could handle the lock failures easily enough by returning a false from mgmt_vty_read_configs() and having mgmt_config_read_in() rerun its timer like it does if vty_mgmt_fe_enabled() returns false.

This code /is/ along a deprecated path since we are moving to unified config, but we aren't totally removing reading of per-daemon configs yet so...

DEFUNSH(VTYSH_REALLYALL, vtysh_config_terminal_file_lock,
vtysh_config_terminal_file_lock_cmd,
"configure terminal file-lock",
"configure [terminal [file-lock]]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make file-lock a hidden command, and just leave the configure [terminal] public.

Copy link
Contributor Author

@idryzhov idryzhov Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using file-lock is not only needed for config reading, but it's also the way to work with FRR in explicit-commit mode instead of implicit-commit-after-every-command mode. However, there are two problems:

  • this explicit-commit mode works only for mgmtd-converted daemons
  • file-lock naming is confusing and doesn't explain that it enables explicit-commit mode

So maybe you're right and we should hide it until all daemons are converted to mgmtd. But also there's an option to rename file-lock to something like locked or with-commit and leave it not hidden.

Anyway, it's not really related to this fix, so can be changed in a separate PR.

vtysh/vtysh_config.c Show resolved Hide resolved
mgmtd/mgmt_txn.c Outdated
* are initializing the backend.
*/
if (mgmt_ds_lock(ds_ctx, 0) != 0) {
__log_err("Running datatore is locked");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this truly an error, or can it happen if multiple backends are trying to initialize at the same time? I'd think this could be common at system startup, especially with a large configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's not really an error, I just used the same logging as used in case of mgmt_txn_create_new error below. I'd like to keep this as-is in this PR to keep code consistent, and we can change it in both places in a separate PR.

mgmtd/mgmt_txn.c Outdated
* Lock the running datastore to prevent any changes while we
* are initializing the backend.
*/
if (mgmt_ds_lock(ds_ctx, 0) != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do this prior to getting changes so we don't have to waste time creating and deleting them if another backend is currently being sync'd? This could actually be pretty expensive for very large configs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I didn't want to rearrange the order of operations in this fix. We're already doing the mgmt_txn_create_new after getting the changes, and it can fail because of the same thing - another running transaction. So putting the locking here seems reasonable. We can rearrange the order in a separate PR.

@choppsv1
Copy link
Contributor

choppsv1 commented Feb 7, 2024

We could print something every so often that says "Another vtysh session has the lock, will retry".

We already do:

% Can't enter config; candidate datastore locked by another session

That isn't printed in the while loop so not sure how that applies here.

@idryzhov
Copy link
Contributor Author

idryzhov commented Feb 7, 2024

This is printed from conf term file-lock command on every iteration of the while loop. There's no need to have duplicated messages.

@idryzhov
Copy link
Contributor Author

idryzhov commented Feb 7, 2024

In mgmt_vty_read_configs() via mgmt_config_read_in() we are counting on this getting called prior to anything else being able to run and lock the DBs. This code is very fragile code. We are depending on this event being handled in the event loop prior to any connect from the backend being handled and gaining the lock.

It seems to me that we could handle the lock failures easily enough by returning a false from mgmt_vty_read_configs() and having mgmt_config_read_in() rerun its timer like it does if vty_mgmt_fe_enabled() returns false.

This code /is/ along a deprecated path since we are moving to unified config, but we aren't totally removing reading of per-daemon configs yet so...

This is correct, but again not directly connected to this PR. I am fixing failing vtysh -b here. BTW, I think when all daemons are converted to mgmtd, the correct way to read the unified config would be to read it directly from mgmtd in mgmt_vty_read_configs instead of calling vtysh -b.

When FRR starts, after mgmtd is initialized, backend clients connect to
it and request their config. To supply the config, mgmtd creates a
configuration transaction. At the same time, `vtysh -b` tries to read
the startup config and configure mgmtd, which also creates a
configuration transaction. If these two actions happen at the exact same
time, there's a conflict between them, because only a single
configuration translaction is allowed. Because of that, vtysh fails and
the config is completely ignored.

When starting the config reading, vtysh locks candidate and running
datastores in mgmtd. This commit adds locking of running datastore when
initializing the backend client. It allows to retry locking on the vtysh
side and read the config only when the lock is aquired instead of
failing.

This change also prevents running datastore from being changed during
initialization of backend clients. This could lead to a desynchronized
state between mgmtd and backends.

Signed-off-by: Igor Ryzhov <[email protected]>
When exiting from a level below the config node, like `router rip`,
vtysh executes a resync by sending "end" and "conf term [file-lock]"
commands to all the daemons. As statet in the description comment, it's
done "in case one of the daemons is somewhere else". I don't think this
actually ever happens, but even if it is, it is a bug in a daemon that
needs to be fixed. This resync was okay before the introduction of
mgmtd, but now it unlocks and locks back the datastores during the
configuration reading process, which can lead to a failure which is
explained in the previous commit.

Signed-off-by: Igor Ryzhov <[email protected]>
@choppsv1 choppsv1 merged commit ab3d084 into FRRouting:master Feb 7, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants