Don't restart generation reading from the beginning on generation table query failure #40

kbr- · 2021-04-21T13:12:53Z

Currently the Master process, whenever it gets an exception (any exception), restarts reading the last generation from the beginning.

But that doesn't make much sense from UX point of view.

Suppose that the last generation was created a week ago.
Now a transient failure (a network partition) causes the Master to fail to fetch the list of generations (which it does periodically to check if there are new generations). This causes an exception, which causes Master to create completely new workers which read that generation from the start! So we have to process a week of work right from the beginning! A week is just an example, this could be a month or longer!

If the Master successfully created workers for the last generation, it shouldn't recreate them on the first opportunity. Let them continue their work (but monitor them for crashes etc.). Probably the only situation where Master should recreate workers would be when it failed to create all workers in the first place; in this case it should destroy the existing workers and try again.

cc @haaawk @avelanarius

avelanarius · 2021-04-22T13:43:27Z

This causes an exception, which causes Master to create completely new workers which read that generation from the start! So we have to process a week of work right from the beginning! A week is just an example, this could be a month or longer!

That's not entirely correct. First of all, when starting workers we trim the generation according to table's TTL. So with the default TTL value (24h), we will only re-read 24 hours of data.

Second, even if this trimming did not happen, when workers start up, they read their previously saved offsets and start from there.

In your replicator run, you might have not witnessed workers starting from previously saved offset, because of such a bug: LocalTransport used internally by replicator to store current offsets in-memory only stores offsets for the generation we are currently processing. Offsets from previous generations are removed, because we will never use them again. But after Master failure, it restarted and erroneously went back to the first generation, for which no offsets were stored. I think that the current version of the library would avoid such a mistake, because it would restart from the generation we were reading before the crash, not the first one. Offsets for this generation would be present.

avelanarius added the base scylla-cdc-base package label Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't restart generation reading from the beginning on generation table query failure #40

Don't restart generation reading from the beginning on generation table query failure #40

kbr- commented Apr 21, 2021

avelanarius commented Apr 22, 2021 •

edited

Loading

Don't restart generation reading from the beginning on generation table query failure #40

Don't restart generation reading from the beginning on generation table query failure #40

Comments

kbr- commented Apr 21, 2021

avelanarius commented Apr 22, 2021 • edited Loading

avelanarius commented Apr 22, 2021 •

edited

Loading