Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestJobQueueSemaphoresFIFO fails occasionally #276

Open
CampGareth opened this issue Oct 10, 2018 · 1 comment
Open

TestJobQueueSemaphoresFIFO fails occasionally #276

CampGareth opened this issue Oct 10, 2018 · 1 comment

Comments

@CampGareth
Copy link
Contributor

CampGareth commented Oct 10, 2018

I've noticed that TestJobQueueSemaphoresFIFO fails sometimes. Running it through stress results in the following:

$ \time -v stress ./fake-batch.test -test.run JobQueueSemaphoresFIFO
<snip>
/tmp/go-stress907143447
--- FAIL: TestJobQueueSemaphoresFIFO (0.00s)
    --- FAIL: TestJobQueueSemaphoresFIFO/paralleltest-0 (0.06s)
    	semaphore_test.go:57: i != iPrime: 19 != 18
    	semaphore_test.go:57: i != iPrime: 20 != 19
    	semaphore_test.go:57: i != iPrime: 21 != 19
    	semaphore_test.go:57: i != iPrime: 22 != 19
    	semaphore_test.go:57: i != iPrime: 23 != 19
    	semaphore_test.go:57: i != iPrime: 24 != 19
    	semaphore_test.go:57: i != iPrime: 25 != 19
    	semaphore_test.go:57: i != iPrime: 26 != 19
    	semaphore_test.go:57: i != iPrime: 27 != 19
    	semaphore_test.go:57: i != iPrime: 28 != 19
    	semaphore_test.go:57: i != iPrime: 29 != 19
    	semaphore_test.go:57: i != iPrime: 30 != 19
    	semaphore_test.go:57: i != iPrime: 31 != 19
    	semaphore_test.go:57: i != iPrime: 32 != 19
    	semaphore_test.go:57: i != iPrime: 33 != 19
    	semaphore_test.go:57: i != iPrime: 34 != 19
    	semaphore_test.go:57: i != iPrime: 35 != 19
    	semaphore_test.go:57: i != iPrime: 36 != 19
    	semaphore_test.go:57: i != iPrime: 37 != 19
    	semaphore_test.go:57: i != iPrime: 38 != 19
    	semaphore_test.go:57: i != iPrime: 39 != 19
    	semaphore_test.go:57: i != iPrime: 40 != 19
    	semaphore_test.go:57: i != iPrime: 41 != 19
    	semaphore_test.go:57: i != iPrime: 42 != 19
    	semaphore_test.go:57: i != iPrime: 43 != 19
    	semaphore_test.go:57: i != iPrime: 44 != 19
    	semaphore_test.go:57: i != iPrime: 45 != 19
    	semaphore_test.go:57: i != iPrime: 46 != 19
    	semaphore_test.go:57: i != iPrime: 47 != 19
    	semaphore_test.go:57: i != iPrime: 48 != 19
    	semaphore_test.go:57: i != iPrime: 49 != 19
    	semaphore_test.go:57: i != iPrime: 50 != 19
    	semaphore_test.go:57: i != iPrime: 51 != 19
    	semaphore_test.go:57: i != iPrime: 52 != 19
    	semaphore_test.go:57: i != iPrime: 53 != 19
    	semaphore_test.go:57: i != iPrime: 54 != 19
    	semaphore_test.go:57: i != iPrime: 55 != 19
    	semaphore_test.go:57: i != iPrime: 56 != 19
    	semaphore_test.go:57: i != iPrime: 57 != 19
    	semaphore_test.go:57: i
7730 runs so far, 455 failures
@pwaller
Copy link
Contributor

pwaller commented Oct 15, 2018

The spirit of the test is to check that jobs are run in the order they are scheduled.

The only relevant error message is the first one, all subsequent ones are knock-ons from the first (they might actually be fine, it's just we can't count them anymore).

The crux of the issue here is that this:

  • This is an accelerated ageing load test to try and tease out races and concurrency bugs.
  • This is actually racing, but the race is in the scheduler, which is something which should only happen under high load.

The race is here:

func (s *jobQueueSemaphores) enqueue(q Q, run func()) {
runningChan, ok := s.running[q]
if !ok {
s.initQueue(q)
runningChan = s.running[q]
}
go func() {
runningChan <- struct{}{}
defer func() { <-runningChan }()
run()
}()
}

The problem is that multiple enqueues() can run before the scheduler has had a chance to run the go func(). The job order is enforced by the writes to runningChan, which is FIFO. But the goroutines are not run FIFO. So the order can be lost before the FIFO is considered.

A true fix would have to implement another queuing mechanism, which didn't rely on the scheduler for ordering in this manner.

The failure rate can be pushed down by 100x with something like this:

        ready := make(chan struct{})
	go func() {
                close(ready)
		runningChan <- struct{}{}
		defer func() { <-runningChan }()

		run()
	}()
        <-ready

However, it doesn't totally fix the problem because a goroutine can be interrupted between the close and the runningChan <- struct{}{} write, so I still observe failures. I'm reluctant to do a change which just moves the problem around.

A flaky test sucks. So we should do something about this. However, I don't really consider this a true bug, because if two 'requests to run a job' run close enough that the scheduling delay matters, those events are simultaneous for all intents and purposes, so either order is valid.

It would matter if we were providing a semantic like "if you observe the job entered the queue, then you run another job, the jobs will run in order". We currently could theoretically fail that test under high load or rare circumstances.

I would consider this a good intermediate exercise to fix some time to get your head around these concurrency issues. There are lots of potential ways of fixing it.

@pwaller pwaller removed their assignment Oct 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants