A warning message when a connection to a shard-aware port times out is bogus #122

vladzcloudius · 2023-06-06T18:31:08Z

What version of Scylla or Cassandra are you using?

2022.2.6

What version of Gocql are you using?

HEAD: e38b2bc

What version of Go are you using?

Irrelevant

What did you do?

Tried to create a new connection to the cluster using a shard-aware port.

What did you expect to see?

An error message that would not confuse me.

What did you see instead?

A very confusing message that made me check a totally irrelevant direction and waste more than 3 working days of multiple people till we were finally able to figure out what the problem was.

If you are having connectivity related issues please share the following additional information

Describe your Cassandra cluster

"Cassandra cluster"?! You really want to fix your GH templates ;)

please provide the following information

output of nodetool status

Can't do! Production system!
Single DC, 36 nodes, 3 racks.
Each rack has 12 nodes.

output of SELECT peer, rpc_address FROM system.peers
rebuild your application with the gocql_debug tag and post the output

Both the above are unfeasible.

Description
The error message in question is this:

xxxx/xx/xx xx:xx:xx scylla: a.b.c.d:19042 connection to shard-aware address a.b.c.d:19042 resulted in wrong shard being assigned; please check that you are not behind a NAT or AddressTranslater which changes source ports; falling back to non-shard-aware port for 5m0s

But the thing is that NAT or an AddressTranslator is not the only possibility here.
Given apache#1701 it's very easy to hit a ConnectTimeout which defaults to 600ms (!!!).

As a result if one of the shards (shard A) is overloaded and a TCP connection to 19042 times out due to that the driver is going to fall back to a "storm" connection policy trying to connect to a non-shard-aware port (9042): https://github.com/scylladb/gocql/blob/master/scylla.go#L422

And then it gets interesting (which also took us some time to figure after we realized that NAT has nothing to do with this): because the driver creates most of TCP connections asynchronously: https://github.com/scylladb/gocql/blob/master/connectionpool.go#L484
the following race may happen:

A connection to Shard A using 19042 is sent.
A connection to Shard B using 19042 is sent.
(2) times out and send a connection to 9042.
(3) lands on shard A and succeeds.
(1) completes but hits (https://github.com/scylladb/gocql/blob/master/scylla.go#L408) and prints the aforementioned message blaming NAT.

So, either fix the message or fix the race.

I'm going to file a separate GH issue about this fallback for a "storm" connection policy in general.

The text was updated successfully, but these errors were encountered:

vladzcloudius · 2023-06-06T18:31:21Z

cc @isburmistrov

vladzcloudius mentioned this issue Jun 6, 2023

Not-shard-aware port fallback policy should be optional #123

Open

roydahan assigned avelanarius Nov 16, 2023

roydahan unassigned avelanarius Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A warning message when a connection to a shard-aware port times out is bogus #122

A warning message when a connection to a shard-aware port times out is bogus #122

vladzcloudius commented Jun 6, 2023 •

edited

Loading

vladzcloudius commented Jun 6, 2023 •

edited

Loading

A warning message when a connection to a shard-aware port times out is bogus #122

A warning message when a connection to a shard-aware port times out is bogus #122

Comments

vladzcloudius commented Jun 6, 2023 • edited Loading

What version of Scylla or Cassandra are you using?

What version of Gocql are you using?

What version of Go are you using?

What did you do?

What did you expect to see?

What did you see instead?

Describe your Cassandra cluster

vladzcloudius commented Jun 6, 2023 • edited Loading

vladzcloudius commented Jun 6, 2023 •

edited

Loading

vladzcloudius commented Jun 6, 2023 •

edited

Loading