Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers stuck #192

Open
d1slike opened this issue May 26, 2021 · 1 comment
Open

Workers stuck #192

d1slike opened this issue May 26, 2021 · 1 comment

Comments

@d1slike
Copy link

d1slike commented May 26, 2021

Context

  • github.com/gocraft/work v0.5.1
  • github.com/gomodule/redigo v1.8.4
  • Several(5 - 6) workers in pods
  • Max concurrency = 200

Expected behavior

  • Workers proceed jobs. The jobs queue keeps around zero.

Corrent behavior

  • All processes are alive but no one processes jobs.
  • The job queue starts growing
  • There is no case when workers start processing again
  • It can be fixed by flushing redis db
  • value of current lock at this moment is much greater than max concurrency (workers:jobs:process:lock = 20519 > 200)

How to reproduce

  • Prepare a small app
package main

import (
	"fmt"
	"log"
	"os"
	"os/signal"
	"time"

	"github.com/gocraft/work"
	"github.com/gomodule/redigo/redis"
)

var redisPool = &redis.Pool{
	MaxActive: 5,
	MaxIdle:   5,
	Wait:      true,
	Dial: func() (redis.Conn, error) {
		return redis.Dial("tcp", ":6379", redis.DialDatabase(3))
	},
}

const (
	namespace = "qwe"
	jobname = "asd"
	concurrency = 10
)

var enqueuer = work.NewEnqueuer(namespace, redisPool)

type ctx struct {
}

func (*ctx) do(_ *work.Job) error {
	fmt.Println("i'm alive", time.Now())
	time.Sleep(time.Millisecond * 5)
	_, err := enqueuer.Enqueue(jobname, nil)
	return err
}

func main() {
	for i := 0; i < concurrency * 2; i++ {
		_, err := enqueuer.Enqueue(jobname, nil)
		if err != nil {
			log.Fatal(err)
		}
	}
	work.
		NewWorkerPool(ctx{}, concurrency, namespace, redisPool).
		JobWithOptions(jobname, work.JobOptions{MaxConcurrency: 200}, (*ctx).do).
		Start()
	signalChan := make(chan os.Signal, 1)
	signal.Notify(signalChan, os.Interrupt, os.Kill)
	<-signalChan
}
  • go mod vendor
  • apply the following diff (allow us not to wait long for a race between dead pool reaper and an alive pool)
diff --git a/vendor/github.com/gocraft/work/dead_pool_reaper.go b/vendor/github.com/gocraft/work/dead_pool_reaper.go
index e930521e..4e1d4534 100644
--- a/vendor/github.com/gocraft/work/dead_pool_reaper.go
+++ b/vendor/github.com/gocraft/work/dead_pool_reaper.go
@@ -10,9 +10,9 @@ import (
 )
  
 const (
-       deadTime          = 10 * time.Second // 2 x heartbeat
-       reapPeriod        = 10 * time.Minute
-       reapJitterSecs    = 30
+       deadTime          = 1 * time.Second // 2 x heartbeat
+       reapPeriod        = 1 * time.Second
+       reapJitterSecs    = 1
        requeueKeysPerJob = 4
 )
  • run two or more instances of the program simultaneously
  • in some time both workers will stop working
  • check the lock value via redis-cli (get qwe:jobs:asd:lock)

Detailed description

Reapers

  1. There is a reaper goroutine that checks heartbeats if workers are alive every 30 seconds.
  2. if the heartbeat for a pool id found and not expired, go to the next pool id
  3. if the heartbeat is expired for the pool id id jobs from the pool in-progress moved to the queue
  4. after heartbeat removing or if no heartbeat found, the pool id is removed from worker_pools
  5. lock is adjusted by the number from lock_info for the pool id and it's removed from there and if lock is less than zero, the lock is set to zero

The problem

Here is a race condition

  • race - two reapers for a different dead pool ids
  • crutch - set lock to zero if it's negative after adjusting by lock_info
    • the first reaper sets lock to -N
    • the second one - to -N - M
    • the first reaper adjust lock to -M (by subtracting -N) and as lock is negative set it to 0
    • the second reaper adjust lock to M

Possible Solution

  • do not adjust lock if it's less than zero (

    work/redis.go

    Line 214 in 5959e69

    redis.call('set', lock, 0)
    )
  • remove dead pool id from worker_pools only after adjusting lock
@d1slike d1slike mentioned this issue May 27, 2021
@imKarthikeyanK
Copy link

Hi, This is very well reported. Thanks a lot.

Am facing the similar issue as well.
I can see your PR was also not merged. Did you found any other workarounds to solve this?

gwthm-in added a commit to gs-engg/work that referenced this issue Dec 29, 2023
gokul1630 pushed a commit to gokul1630/work that referenced this issue Mar 29, 2024
gokul1630 pushed a commit to gokul1630/work that referenced this issue Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants