Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

2.0.0 fix retries operations #315

Closed

Conversation

wandenberg
Copy link
Contributor

Fix retry operations when the cluster is recofigured or when the connection is reset even when authentication is on

@coveralls
Copy link

Coverage Status

Coverage increased (+0.58%) when pulling bff8f65 on wandenberg:2.0.0-fix_retries_operations into 2d92a6b on mongoid:master.

@mateusdelbianco
Copy link

This patch looks good to me. How could we fix it to work on jruby?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.58%) when pulling 7942b91 on wandenberg:2.0.0-fix_retries_operations into 2d92a6b on mongoid:master.

@mateusdelbianco
Copy link

Now we're talking! :D

@matsimitsu
Copy link

Awesome! This fixes an issue we've been battling with for a long time 👍

@mateusdelbianco
Copy link

@durran any chance we can merge this request?

@dawid-sklodowski
Copy link

Somewhat related pull-request which also fixes some other things around retries and failover: #320 (perhaps we could take from both)

# Block on data to read for op_timeout seconds
begin
op_timeout = @options[:op_timeout] || timeout
ready = IO.select([socket], nil, [socket], op_timeout)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just deployed this to our staging environment and this part always raises Errors::OperationTimeout.new("Took more than #{op_timeout} seconds to receive data.") when you use an SSL connection.

It has to do with IO.select and the way Ruby's SSL implements buffers (https://bugs.ruby-lang.org/issues/8875)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @matsimitsu
can you do a small test for me?
replace the line

ready = IO.select([socket], nil, [socket], op_timeout)

by

ready = Kernel::select([socket], nil, [socket], op_timeout)

and see if works under ssl, please.
Kernel::select is the same call already in use in Connectable module, so it should fix the problem.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @wandenberg
I just tested it on staging, but it raises the same error. I added some debugging, so the method looks like this:

    def read_data(socket, length)
      puts '-----'
       puts socket.inspect
      # Block on data to read for op_timeout seconds
      begin
        op_timeout = @options[:op_timeout] || timeout
        ready = Kernel::select([socket], nil, [socket], timeout)
        puts ready.inspect
        puts (ready == nil).inspect
        if ready == nil
          raise Errors::OperationTimeout.new("Took more than #{op_timeout} seconds to receive data.")
        end
      rescue IOError => e
        raise Errors::ConnectionFailure
      end

      # Read data from socket
      begin
        data = socket.read(length)
      rescue SystemCallError, IOError => e
        raise Errors::ConnectionFailure.new("Attempted to read #{length} bytes from the socket but an error happend #{e.message}.")
      end

      if data.length < length
        data << read_data(socket, length - data.length)
      end
      data
    end

This yields the following output:

-----
#<Moped::Connection::Socket::SSL:0x007f38646753d8 @port=27017, @host="xxx", @socket=#<TCPSocket:fd 8>, @io=#<TCPSocket:fd 8>, @context=#<OpenSSL::SSL::SSLContext:0x007f3864675388 @cert=#<OpenSSL::X509::Certificate subject=#<OpenSSL::X509::Name:0x007f3864674500>, issuer=#<OpenSSL::X509::Name:0x007f3864674488>, serial=#<OpenSSL::BN:0x007f3864674410>, not_before=2014-08-25 14:47:18 UTC, not_after=2020-05-21 14:47:18 UTC>, @key=#<OpenSSL::PKey::RSA:0x007f3864675180>, @client_ca=nil, @ca_file="/etc/ssl/certs/root_ca.crt", @ca_path=nil, @timeout=nil, @verify_mode=nil, @verify_depth=nil, @renegotiation_cb=nil, @verify_callback=nil, @options=nil, @cert_store=nil, @extra_chain_cert=nil, @client_cert_cb=nil, @tmp_dh_callback=nil, @session_id_context=nil, @session_get_cb=nil, @session_new_cb=nil, @session_remove_cb=nil, @servername_cb=nil, @npn_protocols=nil, @npn_select_cb=nil>, @sync_close=true, @hostname=nil, @eof=false, @rbuffer="", @sync=true, @callback_state=nil, @wbuffer="">
[[#<Moped::Connection::Socket::SSL:0x007f38646753d8 @port=27017, @host="xxx", @socket=#<TCPSocket:fd 8>, @io=#<TCPSocket:fd 8>, @context=#<OpenSSL::SSL::SSLContext:0x007f3864675388 @cert=#<OpenSSL::X509::Certificate subject=#<OpenSSL::X509::Name:0x007f386467b968>, issuer=#<OpenSSL::X509::Name:0x007f386467b8f0>, serial=#<OpenSSL::BN:0x007f386467b878>, not_before=2014-08-25 14:47:18 UTC, not_after=2020-05-21 14:47:18 UTC>, @key=#<OpenSSL::PKey::RSA:0x007f3864675180>, @client_ca=nil, @ca_file="/etc/ssl/certs/root_ca.crt", @ca_path=nil, @timeout=nil, @verify_mode=nil, @verify_depth=nil, @renegotiation_cb=nil, @verify_callback=nil, @options=nil, @cert_store=nil, @extra_chain_cert=nil, @client_cert_cb=nil, @tmp_dh_callback=nil, @session_id_context=nil, @session_get_cb=nil, @session_new_cb=nil, @session_remove_cb=nil, @servername_cb=nil, @npn_protocols=nil, @npn_select_cb=nil>, @sync_close=true, @hostname=nil, @eof=false, @rbuffer="", @sync=true, @callback_state=nil, @wbuffer="">], [], []]
false
-----
#<Moped::Connection::Socket::SSL:0x007f38646753d8 @port=27017, @host="xxx", @socket=#<TCPSocket:fd 8>, @io=#<TCPSocket:fd 8>, @context=#<OpenSSL::SSL::SSLContext:0x007f3864675388 @cert=#<OpenSSL::X509::Certificate subject=#<OpenSSL::X509::Name:0x007f386467ac70>, issuer=#<OpenSSL::X509::Name:0x007f386467abf8>, serial=#<OpenSSL::BN:0x007f386467ab80>, not_before=2014-08-25 14:47:18 UTC, not_after=2020-05-21 14:47:18 UTC>, @key=#<OpenSSL::PKey::RSA:0x007f3864675180>, @client_ca=nil, @ca_file="/etc/ssl/certs/root_ca.crt", @ca_path=nil, @timeout=nil, @verify_mode=nil, @verify_depth=nil, @renegotiation_cb=nil, @verify_callback=nil, @options=nil, @cert_store=nil, @extra_chain_cert=nil, @client_cert_cb=nil, @tmp_dh_callback=nil, @session_id_context=nil, @session_get_cb=nil, @session_new_cb=nil, @session_remove_cb=nil, @servername_cb=nil, @npn_protocols=nil, @npn_select_cb=nil>, @sync_close=true, @hostname=nil, @eof=false, @rbuffer="-\x00\x00\x00\x02nonce\x00\x11\x00\x00\x004fdcde36d3f5e32d\x00\x01ok\x00\x00\x00\x00\x00\x00\x00\xF0?\x00", @sync=true, @callback_state=nil, @wbuffer="">
nil
true

The Kernel::select returns nil, because data is in the buffer.
After checking the Connectable module i noticed that alive? returns true if Kernel::select returns nil and only checks for eof? if Kernel::select returns something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @matsimitsu,

I set up a mongod with SSL and was able to reproduce the problem.
Following the instructions on Kernel::select doc changed the code to the version bellow.
Can you test on your server before I commit it, please?

  def read_data(socket, length)
      # Block on data to read for op_timeout seconds
      time_left = op_timeout = @options[:op_timeout] || timeout
      begin
        raise Errors::OperationTimeout.new("Took more than #{op_timeout} seconds to receive data.") if (time_left -= 0.1) <= 0
        data = socket.read_nonblock(length)
      rescue IO::WaitReadable
        Kernel::select([socket], nil, [socket], 0.1)
        retry
      rescue IO::WaitWritable
        Kernel::select(nil, [socket], [socket], 0.1)
        retry
      rescue SystemCallError, IOError => e
        raise Errors::ConnectionFailure.new("Attempted to read #{length} bytes from the socket but an error happend #{e.message}.")
      end

      if data.length < length
        data << read_data(socket, length - data.length)
      end
      data
    end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This works!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dawid-sklodowski
Copy link

Tests executed against real replica-set are very good idea, especially that they are able to run on Travis.

👍

@coveralls
Copy link

Coverage Status

Coverage increased (+0.38%) when pulling 1dc8ca5 on wandenberg:2.0.0-fix_retries_operations into 2d92a6b on mongoid:master.

@jarthod
Copy link

jarthod commented Oct 1, 2014

I also hit this error in production, i'm looking forward for this fix to be merged ;)

@albertoleal
Copy link

+1. I'm getting the same issue here.

@arthurnn
Copy link
Contributor

closing this , as #328 is more up to date.

@arthurnn arthurnn closed this Nov 18, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants