fix(bootstrap): wait core tables are ready before copying #183

keynslug · 2024-10-18T19:56:06Z

In specific circumstances mria_mnesia:copy_table/2 may fail with {system_limit, '$mria_rlog_sync', {Node, none_active}} error, which crashes the node.

Consider the following scenario:

Node N1 starts up and bootstraps Mria.
Node N2 starts up and bootstraps Mria.
Node N2 joins cluster consisting of node N1.
Node N2 runs mria_mnesia:join_cluster/1 and starts Mria again.
At the exact same time node N1 decides to restart for some reason.
During bootstrap, node N2 tries to copy $mria_rlog_sync table.
Mnesia sees there's nowhere to copy from and aborts the operation.
Mria fails to start.

While unlikely, in practice this might be achieved when the operator performs unusual maintenance operations, e.g. simultaneously requests version upgrade and scales the cluster up.

Fixes EMQX-13309.

zmstone · 2024-10-19T06:49:45Z

src/mria_schema.erl

                       , config = Opts
                       },
    %% Create (or copy) the mnesia table and wait for it:
    ok = create_table(MetaSpec),
-    ok = mria_mnesia:copy_table(?schema, Storage),
+    %% Ensure replicas are available before starting copy:
+    ok = mria_mnesia:wait_for_tables([?schema]),


I assume this function does not return anything else than ok?

if that's the case, maybe add a comment here to document what it means to the node boot sequence if this wait has to take a very long time (or never returns ok)

It can return {error, ...} when mnesia is stopped while the bootstrap is in progress.

This calls mria_mnesia:wait_for_tables/1, which (unlike mnesia:wait_for_tables/1) logs quite extensive diagnostic information every 30 seconds if this wait is taking too long, so this should be enough? A comment detailing why this was needed won't hurt, in addition to commit message, I guess.

In specific circumstances `mria_mnesia:copy_table/2` may fail with `{system_limit, '$mria_rlog_sync', {Node, none_active}}` error, which crashes the node. Consider the following scenario: 1. Node `N1` starts up and bootstraps Mria. 2. Node `N2` starts up and bootstraps Mria. 3. Node `N2` joins cluster consisting of node `N1`. 4. Node `N2` runs `mria_mnesia:join_cluster/1` and starts Mria again. 5. At the exact same time node `N1` decides to restart for some reason. 6. During bootstrap, node `N2` tries to copy `$mria_rlog_sync` table. 7. Mnesia sees there's nowhere to copy from and aborts the operation. 8. Mria fails to start. While unlikely, in practice this might be achieved when the operator performs unusual maintenance operations, e.g. simultaneously requests version upgrade and scales the cluster up.

Silence "expression updates a literal" compiler lint recently introduced in erlang/otp#8069.

thalesmg previously approved these changes Oct 18, 2024

View reviewed changes

zmstone reviewed Oct 19, 2024

View reviewed changes

keynslug added 3 commits October 21, 2024 12:00

chore: ensure Erlang/OTP 27 compat

25c7fb1

Silence "expression updates a literal" compiler lint recently introduced in erlang/otp#8069.

test(bootstrap): test mria starts if nodes unavailable in bootrstrap

6084346

keynslug dismissed thalesmg’s stale review via 6084346 October 21, 2024 10:01

keynslug force-pushed the fix/race-system-limit branch from 13f9607 to 6084346 Compare October 21, 2024 10:01

keynslug requested a review from zmstone October 21, 2024 10:09

zmstone approved these changes Oct 21, 2024

View reviewed changes

keynslug merged commit 4cab8da into emqx:main Oct 21, 2024
1 check passed

keynslug deleted the fix/race-system-limit branch October 21, 2024 11:32

keynslug mentioned this pull request Oct 21, 2024

chore: upgrade to mria 0.8.10 emqx/ekka#239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bootstrap): wait core tables are ready before copying #183

fix(bootstrap): wait core tables are ready before copying #183

keynslug commented Oct 18, 2024 •

edited

Loading

zmstone Oct 19, 2024

keynslug Oct 21, 2024

fix(bootstrap): wait core tables are ready before copying #183

fix(bootstrap): wait core tables are ready before copying #183

Conversation

keynslug commented Oct 18, 2024 • edited Loading

zmstone Oct 19, 2024

Choose a reason for hiding this comment

keynslug Oct 21, 2024

Choose a reason for hiding this comment

keynslug commented Oct 18, 2024 •

edited

Loading