Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't restart generation reading from the beginning on generation table query failure #40

Open
kbr- opened this issue Apr 21, 2021 · 1 comment
Labels
base scylla-cdc-base package

Comments

@kbr-
Copy link

kbr- commented Apr 21, 2021

Currently the Master process, whenever it gets an exception (any exception), restarts reading the last generation from the beginning.

But that doesn't make much sense from UX point of view.

Suppose that the last generation was created a week ago.
Now a transient failure (a network partition) causes the Master to fail to fetch the list of generations (which it does periodically to check if there are new generations). This causes an exception, which causes Master to create completely new workers which read that generation from the start! So we have to process a week of work right from the beginning! A week is just an example, this could be a month or longer!

If the Master successfully created workers for the last generation, it shouldn't recreate them on the first opportunity. Let them continue their work (but monitor them for crashes etc.). Probably the only situation where Master should recreate workers would be when it failed to create all workers in the first place; in this case it should destroy the existing workers and try again.

cc @haaawk @avelanarius

@avelanarius avelanarius added the base scylla-cdc-base package label Apr 22, 2021
@avelanarius
Copy link
Contributor

avelanarius commented Apr 22, 2021

This causes an exception, which causes Master to create completely new workers which read that generation from the start! So we have to process a week of work right from the beginning! A week is just an example, this could be a month or longer!

That's not entirely correct. First of all, when starting workers we trim the generation according to table's TTL. So with the default TTL value (24h), we will only re-read 24 hours of data.

Second, even if this trimming did not happen, when workers start up, they read their previously saved offsets and start from there.

In your replicator run, you might have not witnessed workers starting from previously saved offset, because of such a bug: LocalTransport used internally by replicator to store current offsets in-memory only stores offsets for the generation we are currently processing. Offsets from previous generations are removed, because we will never use them again. But after Master failure, it restarted and erroneously went back to the first generation, for which no offsets were stored. I think that the current version of the library would avoid such a mistake, because it would restart from the generation we were reading before the crash, not the first one. Offsets for this generation would be present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
base scylla-cdc-base package
Projects
None yet
Development

No branches or pull requests

2 participants