Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*() #202

leonardvimond · 2016-02-25T10:39:05Z

************ OVERVIEW ************
When a LocationForward occurred and that TAO is retrieving a profile to which it can get connected, then all new outgoing requests are blocked by a mutex in TAO_FT_Invocation_Endpoint_Selector::select_primary or in TAO_FT_Invocation_Endpoint_Selector::select_secondary, as long as the request in progress has not found any profile.
It looks like that each request, once it got the mutex, will try to connect to each profile of the IOGR at the moment it arrived, and will not necessarily use the IOGR updated by the first request. If some profiles are unreachable, then the attempts of connection can be long, and consequently all pending requests will be delayed.
If one configure a Relative RoundTrip Timeout, he will possibly get TIMEOUT to these requests while there would be enough time to get a reply from the new primary.

************ ISSUE ************
I have a use case with a FT client sending many requests to a FT replicated server, and the FT primary server is unplugged from the network.
We expect all requests to be forwarded to the new primary once the switch is over, but many requests get TIMEOUT instead.

For a disconnection of 10.100.14.96 at 16:50:01Z and a RTTT=20s (/var/log/messages-20160214:Feb 12 16:50:01 systint85 kernel: bnx2 0000:03:00.0: eth0: NIC Copper Link is Down), the failure of TCP connection is detected after 6s (as expected, thanks to the TCP Keep Alive we have configured):
#16:50:08.107683

TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <13602> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error
...
#16:50:23.041159

TAO_FT (20595|140665202775808) - Got a primary component

And then some attempts of reconnection fail after 3s, accordingly to the TCP parameter tcp_retries2=3.
#16:50:23.042602

TAO (20595|140665202775808) - IIOP_Connector::begin_connection, to 10.100.14.96:11063 which should block
...
#16:50:26.481488

TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <0> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error

A very long time (15s here) is spent between the failure and the first attempt of reconnection, which looks to be only explained by the time needed to gain the mutex in FTSelector.
All requests will pay the cost of 3s when attempting the unreachable profile, and last ones will finish with a TIMEOUT.

************ FIX ************
Making a copy of profiles and release immediately the Mutex enables to all requests to be processed at the same time, they will all try to find the right profile concurrently.
That fix has been validated on the old TAO-V161, however the relative code looks to have been very stable since then and it may work the same in latest releases.

OVERVIEW When a LocationForward occurred and that TAO is retrieving a profile to which it can get connected, then all new outgoing requests are blocked by a mutex in TAO_FT_Invocation_Endpoint_Selector::select_primary or in TAO_FT_Invocation_Endpoint_Selector::select_secondary, as long as the request in progress has not found any profile. It looks like that each request, once it got the mutex, will try to connect to each profile of the IOGR at the moment it arrived, and will not necessarily use the IOGR updated by the first request. If some profiles are unreachable, then the attempts of connection can be long, and consequently all pending requests will be delayed. If one configure a Relative RoundTrip Timeout, he will possibly get TIMEOUT to these requests while there would be enough time to get a reply from the new primary. ISSUE I have a use case with a FT client sending many requests to a FT (replicated) server, and the FT primary

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()

jwillemsen · 2016-09-12T06:53:09Z

Can you add or extend an automated unit test for this?

leonardvimond added 2 commits February 25, 2016 11:25

Merge pull request #1 from leonardvimond/FTClient-FasterSelector

a7f0d24

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*()

jwillemsen added the needs review Needs to be reviewed label May 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*() #202

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*() #202

leonardvimond commented Feb 25, 2016

jwillemsen commented Sep 12, 2016

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*() #202

Are you sure you want to change the base?

Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*() #202

Conversation

leonardvimond commented Feb 25, 2016

jwillemsen commented Sep 12, 2016