Workers stuck #192

d1slike · 2021-05-26T08:59:24Z

Context

github.com/gocraft/work v0.5.1
github.com/gomodule/redigo v1.8.4
Several(5 - 6) workers in pods
Max concurrency = 200

Expected behavior

Workers proceed jobs. The jobs queue keeps around zero.

Corrent behavior

All processes are alive but no one processes jobs.
The job queue starts growing
There is no case when workers start processing again
It can be fixed by flushing redis db
value of current lock at this moment is much greater than max concurrency (workers:jobs:process:lock = 20519 > 200)

How to reproduce

Prepare a small app

package main

import (
	"fmt"
	"log"
	"os"
	"os/signal"
	"time"

	"github.com/gocraft/work"
	"github.com/gomodule/redigo/redis"
)

var redisPool = &redis.Pool{
	MaxActive: 5,
	MaxIdle:   5,
	Wait:      true,
	Dial: func() (redis.Conn, error) {
		return redis.Dial("tcp", ":6379", redis.DialDatabase(3))
	},
}

const (
	namespace = "qwe"
	jobname = "asd"
	concurrency = 10
)

var enqueuer = work.NewEnqueuer(namespace, redisPool)

type ctx struct {
}

func (*ctx) do(_ *work.Job) error {
	fmt.Println("i'm alive", time.Now())
	time.Sleep(time.Millisecond * 5)
	_, err := enqueuer.Enqueue(jobname, nil)
	return err
}

func main() {
	for i := 0; i < concurrency * 2; i++ {
		_, err := enqueuer.Enqueue(jobname, nil)
		if err != nil {
			log.Fatal(err)
		}
	}
	work.
		NewWorkerPool(ctx{}, concurrency, namespace, redisPool).
		JobWithOptions(jobname, work.JobOptions{MaxConcurrency: 200}, (*ctx).do).
		Start()
	signalChan := make(chan os.Signal, 1)
	signal.Notify(signalChan, os.Interrupt, os.Kill)
	<-signalChan
}

go mod vendor
apply the following diff (allow us not to wait long for a race between dead pool reaper and an alive pool)

diff --git a/vendor/github.com/gocraft/work/dead_pool_reaper.go b/vendor/github.com/gocraft/work/dead_pool_reaper.go
index e930521e..4e1d4534 100644
--- a/vendor/github.com/gocraft/work/dead_pool_reaper.go
+++ b/vendor/github.com/gocraft/work/dead_pool_reaper.go
@@ -10,9 +10,9 @@ import (
 )
  
 const (
-       deadTime          = 10 * time.Second // 2 x heartbeat
-       reapPeriod        = 10 * time.Minute
-       reapJitterSecs    = 30
+       deadTime          = 1 * time.Second // 2 x heartbeat
+       reapPeriod        = 1 * time.Second
+       reapJitterSecs    = 1
        requeueKeysPerJob = 4
 )

run two or more instances of the program simultaneously
in some time both workers will stop working
check the lock value via redis-cli (get qwe:jobs:asd:lock)

Detailed description

Reapers

There is a reaper goroutine that checks heartbeats if workers are alive every 30 seconds.
if the heartbeat for a pool id found and not expired, go to the next pool id
if the heartbeat is expired for the pool id id jobs from the pool in-progress moved to the queue
after heartbeat removing or if no heartbeat found, the pool id is removed from worker_pools
lock is adjusted by the number from lock_info for the pool id and it's removed from there and if lock is less than zero, the lock is set to zero

The problem

Here is a race condition

race - two reapers for a different dead pool ids
crutch - set lock to zero if it's negative after adjusting by lock_info
- the first reaper sets lock to -N
- the second one - to -N - M
- the first reaper adjust lock to -M (by subtracting -N) and as lock is negative set it to 0
- the second reaper adjust lock to M

Possible Solution

do not adjust lock if it's less than zero (

work/redis.go

Line 214 in 5959e69

redis.call('set', lock, 0)

)
remove dead pool id from worker_pools only after adjusting lock
- move this
  
  work/dead_pool_reaper.go
  
  Line 103 in 5959e69
  
  if err = r.cleanStaleLockInfo(deadPoolID, lockJobTypes); err != nil {
- move this before this
  
  work/dead_pool_reaper.go
  
  Line 99 in 5959e69
  
  if _, err = conn.Do("SREM", workerPoolsKey, deadPoolID); err != nil {

The text was updated successfully, but these errors were encountered:

imKarthikeyanK · 2023-05-09T08:46:03Z

Hi, This is very well reported. Thanks a lot.

Am facing the similar issue as well.
I can see your PR was also not merged. Did you found any other workarounds to solve this?

For issue context, please check: gocraft#192

d1slike mentioned this issue May 27, 2021

Workers fix #193

Open

gwthm-in added a commit to gs-engg/work that referenced this issue Dec 29, 2023

fix(concurrency): Fix release lock for concurrency jobs

b14a0f2

For issue context, please check: gocraft#192

gokul1630 pushed a commit to gokul1630/work that referenced this issue Mar 29, 2024

fix(concurrency): Fix release lock for concurrency jobs

105bc56

For issue context, please check: gocraft#192

gokul1630 pushed a commit to gokul1630/work that referenced this issue Mar 29, 2024

fix(concurrency): Fix release lock for concurrency jobs

a24d41a

For issue context, please check: gocraft#192

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers stuck #192

Workers stuck #192

d1slike commented May 26, 2021 •

edited

Loading

imKarthikeyanK commented May 9, 2023

Workers stuck #192

Workers stuck #192

Comments

d1slike commented May 26, 2021 • edited Loading

Context

Expected behavior

Corrent behavior

How to reproduce

Detailed description

Reapers

The problem

Possible Solution

imKarthikeyanK commented May 9, 2023

d1slike commented May 26, 2021 •

edited

Loading