-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad connection driver error #57
Comments
I am also getting these type of errors. In my case it happens with very long running statements (a call to apoc.refactor.mergeNodes) and I believe the connection times out as it is set at 1 Minute by default. Still, after such an error I get these bad driver errors for the next queries too. I am not sure if it is also a problem with the server hanging with heavy calculations as I am running neo4j locally in a Docker container on my dual-core laptop. |
Our scenario is quite different. We're ingesting messages off a message broker and using a persistent connection pool to send a merge to our Neo4j instance. Our Neo4j instance is a VM and our app that uses the driver is in a docker container in a Kubernetes pod. Most of the time it's fine and processes messages smoothly but after long layoffs these messages tend to pop up. I've been digging into the Bolt protocol details here: https://boltprotocol.org/v1/#examples Basically it looks like at some point during query issue the connection has gone stale or fails in some way. Line 697 in
If it made it that far it's already sent the I'm gonna turn on |
After digging into this more it appears that connections do become stale after a sufficient idle period. I came up with a couple of different ideas how to deal with the issue. The first is to check the to see if the underlying connection can be read from before returning it. The existing code was doing a nil check on the connection before returning it which wouldn't detect
This should take care of the the If this isn't successful I will try to downcast the connection to a |
I'm still having the same issue after this change.
I assume that reading zero bytes isn't enough to actually detect that the connection is bad since it seems to fail on a read in consume right after that. |
Yeah, this doesn't seem to fix the issue in our environment either. Believe it's a sync/cross-threading issue, I've been able to reproduce this issue with this simple POC: https://github.com/collisonchris/neo4j-test The docs do call out that connection objects are not thread-safe and would require synchronized access on the user side to safely access a connection across goroutines. |
I would think with OpenPool() you should be ok since you should receive a new connection each call to OpenPool() even across goroutines. In your example you are using one bolt connection per goroutine and closing it after. Perhaps OpenPool() can return the same connection multiple times if you call it too quickly? |
Yeah this looks like a type of unintended concurrent access of the underlying Conns; multiple go routines are hitting the same connection and execute queries in an interleaved fashion. The log output of the simple POC I posted seems to reproduce it: Healthy cycle:
Interleaved cycle:
You'll also see successful interleaved queries at times too, I'm not sure what underlying conditions are causing the issue. |
Reading over the documentation it looks like you are meant to close the drivers from the pool. I saw there is another CloseableDriverPool which is not thread safe. Once I stopped closing the pool connections it stopped giving me issues but I am also using a different server. |
I've been using the driver in an application for about 5 months and have had very few issues. However, recently I've noticed I've been getting
bad connection
errors. It always seems to be pop up from the decoderread()
call. The error and stack trace look like this:Error Message
Couldn't read expected bytes for message length. Read: 0 Expected: 2. Internal Error(*errors.errorString):driver: bad connection
Stack Trace
I am running in the pooled connection mode with 5 connections currently. The code below is how I am building the pool and retrieving a connection object from the pool.
The errors seem to happen in very short bursts and are consistently come in groups of 5 and always very close together(usually same millisecond), guessing this is all 5 of my connections in the pool being closed/resetting? I don't see any errors on the Neo4j server logs side about closing connection, connection reset by peer etc.
Any ideas as to what's going on here? Are there any configuration changes I could make that might help with this?
The text was updated successfully, but these errors were encountered: