-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix conflict in mgmtd on startup #15286
Conversation
ba1c3c2
to
aa7bc9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
aa7bc9e
to
b9537f0
Compare
* will be fully ignored. | ||
*/ | ||
while (vtysh_execute_no_pager("conf term file-lock") == CMD_WARNING_CONFIG_FAILED) | ||
usleep(100000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that vtysh will never get the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not during the startup. After the startup, it is possible if someone uses conf t file-lock
in one vtysh
instance and trying to use vtysh -b
in another terminal at the same time. But I don't think we should cover this - it's up to the user to either exit from the first vtysh instance or kill the second one. I can also add some retry limit if you feel it's necessary.
Igor Ryzhov ***@***.***> writes:
Not during the startup. After the startup, it is possible if someone
uses conf t file-lock in one vtysh instance and trying to use vtysh
-b in another terminal at the same time. But I don't think we should
cover this - it's up to the user to either exit from the first vtysh
instance or kill the second one. I can also add some retry limit if
you feel it's necessary.
We could print something every so often that says "Another vtysh session has the lock, will retry".
|
We already do:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In mgmt_vty_read_configs()
via mgmt_config_read_in()
we are counting on this getting called prior to anything else being able to run and lock the DBs. This code is very fragile code. We are depending on this event being handled in the event loop prior to any connect from the backend being handled and gaining the lock.
It seems to me that we could handle the lock failures easily enough by returning a false from mgmt_vty_read_configs()
and having mgmt_config_read_in()
rerun its timer like it does if vty_mgmt_fe_enabled()
returns false.
This code /is/ along a deprecated path since we are moving to unified config, but we aren't totally removing reading of per-daemon configs yet so...
DEFUNSH(VTYSH_REALLYALL, vtysh_config_terminal_file_lock, | ||
vtysh_config_terminal_file_lock_cmd, | ||
"configure terminal file-lock", | ||
"configure [terminal [file-lock]]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make file-lock
a hidden command, and just leave the configure [terminal]
public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using file-lock
is not only needed for config reading, but it's also the way to work with FRR in explicit-commit mode instead of implicit-commit-after-every-command mode. However, there are two problems:
- this explicit-commit mode works only for mgmtd-converted daemons
file-lock
naming is confusing and doesn't explain that it enables explicit-commit mode
So maybe you're right and we should hide it until all daemons are converted to mgmtd. But also there's an option to rename file-lock
to something like locked
or with-commit
and leave it not hidden.
Anyway, it's not really related to this fix, so can be changed in a separate PR.
mgmtd/mgmt_txn.c
Outdated
* are initializing the backend. | ||
*/ | ||
if (mgmt_ds_lock(ds_ctx, 0) != 0) { | ||
__log_err("Running datatore is locked"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this truly an error, or can it happen if multiple backends are trying to initialize at the same time? I'd think this could be common at system startup, especially with a large configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it's not really an error, I just used the same logging as used in case of mgmt_txn_create_new
error below. I'd like to keep this as-is in this PR to keep code consistent, and we can change it in both places in a separate PR.
mgmtd/mgmt_txn.c
Outdated
* Lock the running datastore to prevent any changes while we | ||
* are initializing the backend. | ||
*/ | ||
if (mgmt_ds_lock(ds_ctx, 0) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do this prior to getting changes so we don't have to waste time creating and deleting them if another backend is currently being sync'd? This could actually be pretty expensive for very large configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I didn't want to rearrange the order of operations in this fix. We're already doing the mgmt_txn_create_new
after getting the changes, and it can fail because of the same thing - another running transaction. So putting the locking here seems reasonable. We can rearrange the order in a separate PR.
That isn't printed in the while loop so not sure how that applies here. |
This is printed from |
This is correct, but again not directly connected to this PR. I am fixing failing |
When FRR starts, after mgmtd is initialized, backend clients connect to it and request their config. To supply the config, mgmtd creates a configuration transaction. At the same time, `vtysh -b` tries to read the startup config and configure mgmtd, which also creates a configuration transaction. If these two actions happen at the exact same time, there's a conflict between them, because only a single configuration translaction is allowed. Because of that, vtysh fails and the config is completely ignored. When starting the config reading, vtysh locks candidate and running datastores in mgmtd. This commit adds locking of running datastore when initializing the backend client. It allows to retry locking on the vtysh side and read the config only when the lock is aquired instead of failing. This change also prevents running datastore from being changed during initialization of backend clients. This could lead to a desynchronized state between mgmtd and backends. Signed-off-by: Igor Ryzhov <[email protected]>
When exiting from a level below the config node, like `router rip`, vtysh executes a resync by sending "end" and "conf term [file-lock]" commands to all the daemons. As statet in the description comment, it's done "in case one of the daemons is somewhere else". I don't think this actually ever happens, but even if it is, it is a bug in a daemon that needs to be fixed. This resync was okay before the introduction of mgmtd, but now it unlocks and locks back the datastores during the configuration reading process, which can lead to a failure which is explained in the previous commit. Signed-off-by: Igor Ryzhov <[email protected]>
b9537f0
to
2574f03
Compare
Fixes #15262
Probably fixes #15267