Relax mutex earlier in FT_Invocation_Endpoint_Selector::select_*() #202
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
************ OVERVIEW ************
When a LocationForward occurred and that TAO is retrieving a profile to which it can get connected, then all new outgoing requests are blocked by a mutex in TAO_FT_Invocation_Endpoint_Selector::select_primary or in TAO_FT_Invocation_Endpoint_Selector::select_secondary, as long as the request in progress has not found any profile.
It looks like that each request, once it got the mutex, will try to connect to each profile of the IOGR at the moment it arrived, and will not necessarily use the IOGR updated by the first request. If some profiles are unreachable, then the attempts of connection can be long, and consequently all pending requests will be delayed.
If one configure a Relative RoundTrip Timeout, he will possibly get TIMEOUT to these requests while there would be enough time to get a reply from the new primary.
************ ISSUE ************
I have a use case with a FT client sending many requests to a FT replicated server, and the FT primary server is unplugged from the network.
We expect all requests to be forwarded to the new primary once the switch is over, but many requests get TIMEOUT instead.
For a disconnection of 10.100.14.96 at 16:50:01Z and a RTTT=20s (/var/log/messages-20160214:Feb 12 16:50:01 systint85 kernel: bnx2 0000:03:00.0: eth0: NIC Copper Link is Down), the failure of TCP connection is detected after 6s (as expected, thanks to the TCP Keep Alive we have configured):
#16:50:08.107683
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <13602> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error
...
#16:50:23.041159
TAO_FT (20595|140665202775808) - Got a primary component
And then some attempts of reconnection fail after 3s, accordingly to the TCP parameter tcp_retries2=3.
#16:50:23.042602
TAO (20595|140665202775808) - IIOP_Connector::begin_connection, to 10.100.14.96:11063 which should block
...
#16:50:26.481488
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, timeout after recv is <0> status <-1>
TAO (20595|140665202775808) - Synch_Twoway_Invocation::wait_for_reply, recovering after an error
A very long time (15s here) is spent between the failure and the first attempt of reconnection, which looks to be only explained by the time needed to gain the mutex in FTSelector.
All requests will pay the cost of 3s when attempting the unreachable profile, and last ones will finish with a TIMEOUT.
************ FIX ************
Making a copy of profiles and release immediately the Mutex enables to all requests to be processed at the same time, they will all try to find the right profile concurrently.
That fix has been validated on the old TAO-V161, however the relative code looks to have been very stable since then and it may work the same in latest releases.