-
Notifications
You must be signed in to change notification settings - Fork 151
2.0.0 fix retries operations #315
2.0.0 fix retries operations #315
Conversation
…on and the node is not a master anymore
… with authorization. after a cluster reconfiguration with auth enabled as example
This patch looks good to me. How could we fix it to work on jruby? |
Now we're talking! :D |
Awesome! This fixes an issue we've been battling with for a long time 👍 |
@durran any chance we can merge this request? |
Somewhat related pull-request which also fixes some other things around retries and failover: #320 (perhaps we could take from both) |
# Block on data to read for op_timeout seconds | ||
begin | ||
op_timeout = @options[:op_timeout] || timeout | ||
ready = IO.select([socket], nil, [socket], op_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just deployed this to our staging environment and this part always raises Errors::OperationTimeout.new("Took more than #{op_timeout} seconds to receive data.")
when you use an SSL connection.
It has to do with IO.select and the way Ruby's SSL implements buffers (https://bugs.ruby-lang.org/issues/8875)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @matsimitsu
can you do a small test for me?
replace the line
ready = IO.select([socket], nil, [socket], op_timeout)
by
ready = Kernel::select([socket], nil, [socket], op_timeout)
and see if works under ssl, please.
Kernel::select is the same call already in use in Connectable module, so it should fix the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wandenberg
I just tested it on staging, but it raises the same error. I added some debugging, so the method looks like this:
def read_data(socket, length)
puts '-----'
puts socket.inspect
# Block on data to read for op_timeout seconds
begin
op_timeout = @options[:op_timeout] || timeout
ready = Kernel::select([socket], nil, [socket], timeout)
puts ready.inspect
puts (ready == nil).inspect
if ready == nil
raise Errors::OperationTimeout.new("Took more than #{op_timeout} seconds to receive data.")
end
rescue IOError => e
raise Errors::ConnectionFailure
end
# Read data from socket
begin
data = socket.read(length)
rescue SystemCallError, IOError => e
raise Errors::ConnectionFailure.new("Attempted to read #{length} bytes from the socket but an error happend #{e.message}.")
end
if data.length < length
data << read_data(socket, length - data.length)
end
data
end
This yields the following output:
-----
#<Moped::Connection::Socket::SSL:0x007f38646753d8 @port=27017, @host="xxx", @socket=#<TCPSocket:fd 8>, @io=#<TCPSocket:fd 8>, @context=#<OpenSSL::SSL::SSLContext:0x007f3864675388 @cert=#<OpenSSL::X509::Certificate subject=#<OpenSSL::X509::Name:0x007f3864674500>, issuer=#<OpenSSL::X509::Name:0x007f3864674488>, serial=#<OpenSSL::BN:0x007f3864674410>, not_before=2014-08-25 14:47:18 UTC, not_after=2020-05-21 14:47:18 UTC>, @key=#<OpenSSL::PKey::RSA:0x007f3864675180>, @client_ca=nil, @ca_file="/etc/ssl/certs/root_ca.crt", @ca_path=nil, @timeout=nil, @verify_mode=nil, @verify_depth=nil, @renegotiation_cb=nil, @verify_callback=nil, @options=nil, @cert_store=nil, @extra_chain_cert=nil, @client_cert_cb=nil, @tmp_dh_callback=nil, @session_id_context=nil, @session_get_cb=nil, @session_new_cb=nil, @session_remove_cb=nil, @servername_cb=nil, @npn_protocols=nil, @npn_select_cb=nil>, @sync_close=true, @hostname=nil, @eof=false, @rbuffer="", @sync=true, @callback_state=nil, @wbuffer="">
[[#<Moped::Connection::Socket::SSL:0x007f38646753d8 @port=27017, @host="xxx", @socket=#<TCPSocket:fd 8>, @io=#<TCPSocket:fd 8>, @context=#<OpenSSL::SSL::SSLContext:0x007f3864675388 @cert=#<OpenSSL::X509::Certificate subject=#<OpenSSL::X509::Name:0x007f386467b968>, issuer=#<OpenSSL::X509::Name:0x007f386467b8f0>, serial=#<OpenSSL::BN:0x007f386467b878>, not_before=2014-08-25 14:47:18 UTC, not_after=2020-05-21 14:47:18 UTC>, @key=#<OpenSSL::PKey::RSA:0x007f3864675180>, @client_ca=nil, @ca_file="/etc/ssl/certs/root_ca.crt", @ca_path=nil, @timeout=nil, @verify_mode=nil, @verify_depth=nil, @renegotiation_cb=nil, @verify_callback=nil, @options=nil, @cert_store=nil, @extra_chain_cert=nil, @client_cert_cb=nil, @tmp_dh_callback=nil, @session_id_context=nil, @session_get_cb=nil, @session_new_cb=nil, @session_remove_cb=nil, @servername_cb=nil, @npn_protocols=nil, @npn_select_cb=nil>, @sync_close=true, @hostname=nil, @eof=false, @rbuffer="", @sync=true, @callback_state=nil, @wbuffer="">], [], []]
false
-----
#<Moped::Connection::Socket::SSL:0x007f38646753d8 @port=27017, @host="xxx", @socket=#<TCPSocket:fd 8>, @io=#<TCPSocket:fd 8>, @context=#<OpenSSL::SSL::SSLContext:0x007f3864675388 @cert=#<OpenSSL::X509::Certificate subject=#<OpenSSL::X509::Name:0x007f386467ac70>, issuer=#<OpenSSL::X509::Name:0x007f386467abf8>, serial=#<OpenSSL::BN:0x007f386467ab80>, not_before=2014-08-25 14:47:18 UTC, not_after=2020-05-21 14:47:18 UTC>, @key=#<OpenSSL::PKey::RSA:0x007f3864675180>, @client_ca=nil, @ca_file="/etc/ssl/certs/root_ca.crt", @ca_path=nil, @timeout=nil, @verify_mode=nil, @verify_depth=nil, @renegotiation_cb=nil, @verify_callback=nil, @options=nil, @cert_store=nil, @extra_chain_cert=nil, @client_cert_cb=nil, @tmp_dh_callback=nil, @session_id_context=nil, @session_get_cb=nil, @session_new_cb=nil, @session_remove_cb=nil, @servername_cb=nil, @npn_protocols=nil, @npn_select_cb=nil>, @sync_close=true, @hostname=nil, @eof=false, @rbuffer="-\x00\x00\x00\x02nonce\x00\x11\x00\x00\x004fdcde36d3f5e32d\x00\x01ok\x00\x00\x00\x00\x00\x00\x00\xF0?\x00", @sync=true, @callback_state=nil, @wbuffer="">
nil
true
The Kernel::select
returns nil, because data is in the buffer.
After checking the Connectable
module i noticed that alive?
returns true if Kernel::select
returns nil and only checks for eof?
if Kernel::select
returns something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @matsimitsu,
I set up a mongod with SSL and was able to reproduce the problem.
Following the instructions on Kernel::select doc changed the code to the version bellow.
Can you test on your server before I commit it, please?
def read_data(socket, length)
# Block on data to read for op_timeout seconds
time_left = op_timeout = @options[:op_timeout] || timeout
begin
raise Errors::OperationTimeout.new("Took more than #{op_timeout} seconds to receive data.") if (time_left -= 0.1) <= 0
data = socket.read_nonblock(length)
rescue IO::WaitReadable
Kernel::select([socket], nil, [socket], 0.1)
retry
rescue IO::WaitWritable
Kernel::select(nil, [socket], [socket], 0.1)
retry
rescue SystemCallError, IOError => e
raise Errors::ConnectionFailure.new("Attempted to read #{length} bytes from the socket but an error happend #{e.message}.")
end
if data.length < length
data << read_data(socket, length - data.length)
end
data
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! This works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Tests executed against real replica-set are very good idea, especially that they are able to run on Travis. 👍 |
I also hit this error in production, i'm looking forward for this fix to be merged ;) |
+1. I'm getting the same issue here. |
closing this , as #328 is more up to date. |
Fix retry operations when the cluster is recofigured or when the connection is reset even when authentication is on