You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tried to create a new connection to the cluster using a shard-aware port.
What did you expect to see?
An error message that would not confuse me.
What did you see instead?
A very confusing message that made me check a totally irrelevant direction and waste more than 3 working days of multiple people till we were finally able to figure out what the problem was.
If you are having connectivity related issues please share the following additional information
Describe your Cassandra cluster
"Cassandra cluster"?! You really want to fix your GH templates ;)
please provide the following information
output of nodetool status
Can't do! Production system!
Single DC, 36 nodes, 3 racks.
Each rack has 12 nodes.
output of SELECT peer, rpc_address FROM system.peers
rebuild your application with the gocql_debug tag and post the output
Both the above are unfeasible.
Description
The error message in question is this:
xxxx/xx/xx xx:xx:xx scylla: a.b.c.d:19042 connection to shard-aware address a.b.c.d:19042 resulted in wrong shard being assigned; please check that you are not behind a NAT or AddressTranslater which changes source ports; falling back to non-shard-aware port for 5m0s
But the thing is that NAT or an AddressTranslator is not the only possibility here.
Given apache#1701 it's very easy to hit a ConnectTimeout which defaults to 600ms (!!!).
As a result if one of the shards (shard A) is overloaded and a TCP connection to 19042 times out due to that the driver is going to fall back to a "storm" connection policy trying to connect to a non-shard-aware port (9042): https://github.com/scylladb/gocql/blob/master/scylla.go#L422
And then it gets interesting (which also took us some time to figure after we realized that NAT has nothing to do with this): because the driver creates most of TCP connections asynchronously: https://github.com/scylladb/gocql/blob/master/connectionpool.go#L484
the following race may happen:
What version of Scylla or Cassandra are you using?
2022.2.6
What version of Gocql are you using?
HEAD: e38b2bc
What version of Go are you using?
Irrelevant
What did you do?
Tried to create a new connection to the cluster using a shard-aware port.
What did you expect to see?
An error message that would not confuse me.
What did you see instead?
A very confusing message that made me check a totally irrelevant direction and waste more than 3 working days of multiple people till we were finally able to figure out what the problem was.
If you are having connectivity related issues please share the following additional information
Describe your Cassandra cluster
"Cassandra cluster"?! You really want to fix your GH templates ;)
please provide the following information
nodetool status
Can't do! Production system!
Single DC, 36 nodes, 3 racks.
Each rack has 12 nodes.
SELECT peer, rpc_address FROM system.peers
gocql_debug
tag and post the outputBoth the above are unfeasible.
Description
The error message in question is this:
But the thing is that NAT or an AddressTranslator is not the only possibility here.
Given apache#1701 it's very easy to hit a ConnectTimeout which defaults to 600ms (!!!).
As a result if one of the shards (shard A) is overloaded and a TCP connection to 19042 times out due to that the driver is going to fall back to a "storm" connection policy trying to connect to a non-shard-aware port (9042): https://github.com/scylladb/gocql/blob/master/scylla.go#L422
And then it gets interesting (which also took us some time to figure after we realized that NAT has nothing to do with this): because the driver creates most of TCP connections asynchronously: https://github.com/scylladb/gocql/blob/master/connectionpool.go#L484
the following race may happen:
So, either fix the message or fix the race.
I'm going to file a separate GH issue about this fallback for a "storm" connection policy in general.
The text was updated successfully, but these errors were encountered: