TestJobQueueSemaphoresFIFO fails occasionally #276

CampGareth · 2018-10-10T14:18:24Z

I've noticed that TestJobQueueSemaphoresFIFO fails sometimes. Running it through stress results in the following:

$ \time -v stress ./fake-batch.test -test.run JobQueueSemaphoresFIFO
<snip>
/tmp/go-stress907143447
--- FAIL: TestJobQueueSemaphoresFIFO (0.00s)
    --- FAIL: TestJobQueueSemaphoresFIFO/paralleltest-0 (0.06s)
    	semaphore_test.go:57: i != iPrime: 19 != 18
    	semaphore_test.go:57: i != iPrime: 20 != 19
    	semaphore_test.go:57: i != iPrime: 21 != 19
    	semaphore_test.go:57: i != iPrime: 22 != 19
    	semaphore_test.go:57: i != iPrime: 23 != 19
    	semaphore_test.go:57: i != iPrime: 24 != 19
    	semaphore_test.go:57: i != iPrime: 25 != 19
    	semaphore_test.go:57: i != iPrime: 26 != 19
    	semaphore_test.go:57: i != iPrime: 27 != 19
    	semaphore_test.go:57: i != iPrime: 28 != 19
    	semaphore_test.go:57: i != iPrime: 29 != 19
    	semaphore_test.go:57: i != iPrime: 30 != 19
    	semaphore_test.go:57: i != iPrime: 31 != 19
    	semaphore_test.go:57: i != iPrime: 32 != 19
    	semaphore_test.go:57: i != iPrime: 33 != 19
    	semaphore_test.go:57: i != iPrime: 34 != 19
    	semaphore_test.go:57: i != iPrime: 35 != 19
    	semaphore_test.go:57: i != iPrime: 36 != 19
    	semaphore_test.go:57: i != iPrime: 37 != 19
    	semaphore_test.go:57: i != iPrime: 38 != 19
    	semaphore_test.go:57: i != iPrime: 39 != 19
    	semaphore_test.go:57: i != iPrime: 40 != 19
    	semaphore_test.go:57: i != iPrime: 41 != 19
    	semaphore_test.go:57: i != iPrime: 42 != 19
    	semaphore_test.go:57: i != iPrime: 43 != 19
    	semaphore_test.go:57: i != iPrime: 44 != 19
    	semaphore_test.go:57: i != iPrime: 45 != 19
    	semaphore_test.go:57: i != iPrime: 46 != 19
    	semaphore_test.go:57: i != iPrime: 47 != 19
    	semaphore_test.go:57: i != iPrime: 48 != 19
    	semaphore_test.go:57: i != iPrime: 49 != 19
    	semaphore_test.go:57: i != iPrime: 50 != 19
    	semaphore_test.go:57: i != iPrime: 51 != 19
    	semaphore_test.go:57: i != iPrime: 52 != 19
    	semaphore_test.go:57: i != iPrime: 53 != 19
    	semaphore_test.go:57: i != iPrime: 54 != 19
    	semaphore_test.go:57: i != iPrime: 55 != 19
    	semaphore_test.go:57: i != iPrime: 56 != 19
    	semaphore_test.go:57: i != iPrime: 57 != 19
    	semaphore_test.go:57: i
7730 runs so far, 455 failures

The text was updated successfully, but these errors were encountered:

pwaller · 2018-10-15T15:03:11Z

The spirit of the test is to check that jobs are run in the order they are scheduled.

The only relevant error message is the first one, all subsequent ones are knock-ons from the first (they might actually be fine, it's just we can't count them anymore).

The crux of the issue here is that this:

This is an accelerated ageing load test to try and tease out races and concurrency bugs.
This is actually racing, but the race is in the scheduler, which is something which should only happen under high load.

The race is here:

platform/cmd/fake-batch/semaphore.go

Lines 48 to 61 in cf16185

    
           func (s *jobQueueSemaphores) enqueue(q Q, run func()) { 
        
           	runningChan, ok := s.running[q] 
        
           	if !ok { 
        
           		s.initQueue(q) 
        
           		runningChan = s.running[q] 
        
           	} 
        
           	go func() { 
        
           		runningChan <- struct{}{} 
        
           		defer func() { <-runningChan }() 
        
           		run() 
        
           	}() 
        
           }

The problem is that multiple enqueues() can run before the scheduler has had a chance to run the go func(). The job order is enforced by the writes to runningChan, which is FIFO. But the goroutines are not run FIFO. So the order can be lost before the FIFO is considered.

A true fix would have to implement another queuing mechanism, which didn't rely on the scheduler for ordering in this manner.

The failure rate can be pushed down by 100x with something like this:

        ready := make(chan struct{})
	go func() {
                close(ready)
		runningChan <- struct{}{}
		defer func() { <-runningChan }()

		run()
	}()
        <-ready

However, it doesn't totally fix the problem because a goroutine can be interrupted between the close and the runningChan <- struct{}{} write, so I still observe failures. I'm reluctant to do a change which just moves the problem around.

A flaky test sucks. So we should do something about this. However, I don't really consider this a true bug, because if two 'requests to run a job' run close enough that the scheduling delay matters, those events are simultaneous for all intents and purposes, so either order is valid.

It would matter if we were providing a semantic like "if you observe the job entered the queue, then you run another job, the jobs will run in order". We currently could theoretically fail that test under high load or rare circumstances.

I would consider this a good intermediate exercise to fix some time to get your head around these concurrency issues. There are lots of potential ways of fixing it.

CampGareth assigned pwaller Oct 10, 2018

pwaller removed their assignment Oct 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestJobQueueSemaphoresFIFO fails occasionally #276

TestJobQueueSemaphoresFIFO fails occasionally #276

CampGareth commented Oct 10, 2018 •

edited

Loading

pwaller commented Oct 15, 2018

TestJobQueueSemaphoresFIFO fails occasionally #276

TestJobQueueSemaphoresFIFO fails occasionally #276

Comments

CampGareth commented Oct 10, 2018 • edited Loading

pwaller commented Oct 15, 2018

CampGareth commented Oct 10, 2018 •

edited

Loading