You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the Master process, whenever it gets an exception (any exception), restarts reading the last generation from the beginning.
But that doesn't make much sense from UX point of view.
Suppose that the last generation was created a week ago.
Now a transient failure (a network partition) causes the Master to fail to fetch the list of generations (which it does periodically to check if there are new generations). This causes an exception, which causes Master to create completely new workers which read that generation from the start! So we have to process a week of work right from the beginning! A week is just an example, this could be a month or longer!
If the Master successfully created workers for the last generation, it shouldn't recreate them on the first opportunity. Let them continue their work (but monitor them for crashes etc.). Probably the only situation where Master should recreate workers would be when it failed to create all workers in the first place; in this case it should destroy the existing workers and try again.
This causes an exception, which causes Master to create completely new workers which read that generation from the start! So we have to process a week of work right from the beginning! A week is just an example, this could be a month or longer!
That's not entirely correct. First of all, when starting workers we trim the generation according to table's TTL. So with the default TTL value (24h), we will only re-read 24 hours of data.
Second, even if this trimming did not happen, when workers start up, they read their previously saved offsets and start from there.
In your replicator run, you might have not witnessed workers starting from previously saved offset, because of such a bug: LocalTransport used internally by replicator to store current offsets in-memory only stores offsets for the generation we are currently processing. Offsets from previous generations are removed, because we will never use them again. But after Master failure, it restarted and erroneously went back to the first generation, for which no offsets were stored. I think that the current version of the library would avoid such a mistake, because it would restart from the generation we were reading before the crash, not the first one. Offsets for this generation would be present.
Currently the Master process, whenever it gets an exception (any exception), restarts reading the last generation from the beginning.
But that doesn't make much sense from UX point of view.
Suppose that the last generation was created a week ago.
Now a transient failure (a network partition) causes the Master to fail to fetch the list of generations (which it does periodically to check if there are new generations). This causes an exception, which causes Master to create completely new workers which read that generation from the start! So we have to process a week of work right from the beginning! A week is just an example, this could be a month or longer!
If the Master successfully created workers for the last generation, it shouldn't recreate them on the first opportunity. Let them continue their work (but monitor them for crashes etc.). Probably the only situation where Master should recreate workers would be when it failed to create all workers in the first place; in this case it should destroy the existing workers and try again.
cc @haaawk @avelanarius
The text was updated successfully, but these errors were encountered: