Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bootstrap): wait core tables are ready before copying #183

Merged
merged 3 commits into from
Oct 21, 2024

Conversation

keynslug
Copy link
Contributor

@keynslug keynslug commented Oct 18, 2024

In specific circumstances mria_mnesia:copy_table/2 may fail with {system_limit, '$mria_rlog_sync', {Node, none_active}} error, which crashes the node.

Consider the following scenario:

  1. Node N1 starts up and bootstraps Mria.
  2. Node N2 starts up and bootstraps Mria.
  3. Node N2 joins cluster consisting of node N1.
  4. Node N2 runs mria_mnesia:join_cluster/1 and starts Mria again.
  5. At the exact same time node N1 decides to restart for some reason.
  6. During bootstrap, node N2 tries to copy $mria_rlog_sync table.
  7. Mnesia sees there's nowhere to copy from and aborts the operation.
  8. Mria fails to start.

While unlikely, in practice this might be achieved when the operator performs unusual maintenance operations, e.g. simultaneously requests version upgrade and scales the cluster up.


Fixes EMQX-13309.

thalesmg
thalesmg previously approved these changes Oct 18, 2024
, config = Opts
},
%% Create (or copy) the mnesia table and wait for it:
ok = create_table(MetaSpec),
ok = mria_mnesia:copy_table(?schema, Storage),
%% Ensure replicas are available before starting copy:
ok = mria_mnesia:wait_for_tables([?schema]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this function does not return anything else than ok?

if that's the case, maybe add a comment here to document what it means to the node boot sequence if this wait has to take a very long time (or never returns ok)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can return {error, ...} when mnesia is stopped while the bootstrap is in progress.

This calls mria_mnesia:wait_for_tables/1, which (unlike mnesia:wait_for_tables/1) logs quite extensive diagnostic information every 30 seconds if this wait is taking too long, so this should be enough? A comment detailing why this was needed won't hurt, in addition to commit message, I guess.

In specific circumstances `mria_mnesia:copy_table/2` may fail with
`{system_limit, '$mria_rlog_sync', {Node, none_active}}` error, which
crashes the node.

Consider the following scenario:
1. Node `N1` starts up and bootstraps Mria.
2. Node `N2` starts up and bootstraps Mria.
3. Node `N2` joins cluster consisting of node `N1`.
4. Node `N2` runs `mria_mnesia:join_cluster/1` and starts Mria again.
5. At the exact same time node `N1` decides to restart for some reason.
6. During bootstrap, node `N2` tries to copy `$mria_rlog_sync` table.
7. Mnesia sees there's nowhere to copy from and aborts the operation.
8. Mria fails to start.

While unlikely, in practice this might be achieved when the operator
performs unusual maintenance operations, e.g. simultaneously requests
version upgrade and scales the cluster up.
Silence "expression updates a literal" compiler lint recently introduced
in erlang/otp#8069.
@keynslug keynslug merged commit 4cab8da into emqx:main Oct 21, 2024
1 check passed
@keynslug keynslug deleted the fix/race-system-limit branch October 21, 2024 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants