You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This was the last mention of the binlog streamer in the logs, I assumed at this point that the streamer had stopped for some reason or another.
Trying to investigate further we used the pprof endpoint for the copydb instance and we were able to get some more information.
First we took a 30 second trace from the copydb instance. The blocking information showed that the main program thread's time was spent almost entirely on the inline verifier's PeriodicallyVerifyBinlogEvents
There wasn't much more useful information from the trace aside from the binlog streamer not appearing in it. We looked to see if anything was deadlocked or waiting to get a mutex but nothing showed up.
We went back to the pprof endpoint for copydb and looked at the goroutine stack dump which just takes a snapshot of the running goroutines. After eliminating the go routines that weren't relevant we were left with the ones related to binlogs.
Of the remaining stack dumps these 2 were of importance:
Here we can see that while the BinlogWriter was reading from the channel, the BinlogStreamer was sending the events into a nil channel. We refreshed a number of times to make sure it wasn't an anomaly.
The minutes also corresponded with the lag and run time for the copy showing that this had been happening since almost the beginning of the run.
Bug
When ferry.go creates a new BinlogWriter, it does not create the event buffer channel. That happens in BinlogWriter.Run().
Here there is a chance that BinlogStreamer.Run() runs before BinlogWriter.Run() has the chance to create the channel. This can result in the event listener for the binlog streamer (BinlogWritter.BufferBinlogEvents) trying to write binlog events into the uninitialized BinlogEventBuffer channel.
Here's an example of a similar scenario we used to reproduce the stack dump:
//Taken from https://golangbyexample.com/send-receive-nil-channel-go/package main
import (
"fmt""time"
)
funcmain() {
varchchanintgosend(ch)
ch=make(chanint)
<-chtime.Sleep(time.Second*1)
}
funcsend(chchanint) {
fmt.Println("Sending value to channnel start")
ch<-1fmt.Println("Sending value to channnel finish")
}
The resulting goroutine dump is very similar to what we saw in ghostferry.
Sending value to channnel start fatal error: all goroutines are asleep - deadlock!
goroutine 1 [chan receive]:
main.main()
/tmp/sandbox3043519188/prog.go:13 +0x90
goroutine 6 [chan send (nil chan)]:
main.send(0x0)
/tmp/sandbox3043519188/prog.go:19 +0x6c
created by main.main
/tmp/sandbox3043519188/prog.go:11 +0x7b Program exited.
Possible Solution
Currently NewBinlogWriter in ferry.go just creates a new BinlogWriter but doesn't initialize the channel
Problem
During a run of copydb we found a problem where the binlog streamer lag kept increasing and the lag was almost identical to the run time of the copy.
Tracing and Investigation
A huge thanks to @shuhaowu for helping me investigate this problem and helping to find the bug.
Initially I looked at the logs for the copydb container:
This was the last mention of the binlog streamer in the logs, I assumed at this point that the streamer had stopped for some reason or another.
Trying to investigate further we used the
pprof
endpoint for the copydb instance and we were able to get some more information.First we took a 30 second trace from the copydb instance. The blocking information showed that the main program thread's time was spent almost entirely on the inline verifier's
PeriodicallyVerifyBinlogEvents
There wasn't much more useful information from the trace aside from the binlog streamer not appearing in it. We looked to see if anything was deadlocked or waiting to get a mutex but nothing showed up.
We went back to the
pprof
endpoint for copydb and looked at the goroutine stack dump which just takes a snapshot of the running goroutines. After eliminating the go routines that weren't relevant we were left with the ones related to binlogs.Of the remaining stack dumps these 2 were of importance:
Goroutine Stack Dump
Here we can see that while the
BinlogWriter
was reading from the channel, theBinlogStreamer
was sending the events into anil
channel. We refreshed a number of times to make sure it wasn't an anomaly.The minutes also corresponded with the lag and run time for the copy showing that this had been happening since almost the beginning of the run.
Bug
When
ferry.go
creates a new BinlogWriter, it does not create the event buffer channel. That happens inBinlogWriter.Run()
.The problem is in
ghostferry/ferry.go
Lines 741 to 752 in abcda3a
Here there is a chance that
BinlogStreamer.Run()
runs beforeBinlogWriter.Run()
has the chance to create the channel. This can result in the event listener for the binlog streamer (BinlogWritter.BufferBinlogEvents
) trying to write binlog events into the uninitializedBinlogEventBuffer
channel.Here's an example of a similar scenario we used to reproduce the stack dump:
The resulting goroutine dump is very similar to what we saw in ghostferry.
Possible Solution
Currently
NewBinlogWriter
inferry.go
just creates a newBinlogWriter
but doesn't initialize the channelghostferry/ferry.go
Lines 163 to 178 in abcda3a
What could be done is that instead of immediately returning the writer, initialize the buffer channel on the writer in
ferry.go
and then return it.This would be similar to what we do in
NewBatchWriter
ghostferry/ferry.go
Lines 186 to 202 in abcda3a
The text was updated successfully, but these errors were encountered: