Some of the in progress jobs cannot be restored after restarting #146

steven-zou · 2019-12-28T04:00:41Z

A restart happened when large scale jobs are running. After that, some of the jobs queued in the in-progress queue (which depends on the worker ID => return fmt.Sprintf("%s:%s:inprogress", redisKeyJobs(namespace, jobName), poolID) will not be restored.

A restart will recreate the worker pools and generate new workers with new UUIDs. And it seems that the dead pool reaper thread only check the workers of current worker pool and the previous one will be discarded. However, some of the in-progress queues are relying on those workers. It results in that some of the in-progress jobs cannot be requeued.

The Reap flow seems like the following one:

Find the dead pools first,

deadPoolIDs, err := r.findDeadPools()

In the dead pool finding process,

workerPoolsKey := redisKeyWorkerPools(r.namespace)

	workerPoolIDs, err := redis.Strings(conn.Do("SMEMBERS", workerPoolsKey))
	if err != nil {
		return nil, err
	}

as most of the time, after the restart, the previous pool is gone, but there are still some in-progress queues with in-progress jobs, those jobs are becoming unavailable anymore.