Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost messages on failed PUB/MPUB on nsqd restart/connection break due to load #51

Open
Soulou opened this issue Jan 16, 2019 · 5 comments

Comments

@Soulou
Copy link
Contributor

Soulou commented Jan 16, 2019

Basically we were loosing messages in our infrastructure, this is the result of my investigation.

I did a PoC producer writing one message every second, and one consumer consuming these messages, and during the process I was restarting the nsqd instance.

The result was pretty straightforward, here are the consumer logs:

Hello 1
[nsqworker] test-test - 0ae480263344d010 - done     job 1 - 9.731e-06s
[nsqworker] test-test - 0ae480263344d011 - starting job 2
Hello 2
[nsqworker] test-test - 0ae480263344d011 - done     job 2 - 2.4468e-05s
[nsqworker] test-test - 0ae480263344d012 - starting job 3
Hello 3
[nsqworker] test-test - 0ae480263344d012 - done     job 3 - 9.773e-06s
W, [2019-01-16T19:52:35.477176 #17425]  WARN -- : [host: 127.0.0.1 port: 4150]  Died from: No data from socket
W, [2019-01-16T19:52:35.480081 #17425]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection reset by peer. Retrying in 0.5 seconds.
W, [2019-01-16T19:52:35.980996 #17425]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection reset by peer. Retrying in 0.6 seconds.
W, [2019-01-16T19:52:36.592932 #17425]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection refused - connect(2) for "127.0.0.1" port 4150. Retrying in 1.3 seconds.
[nsqworker] test-test - 0ae48055c884d000 - starting job 5
Hello 5
[nsqworker] test-test - 0ae48055c884d000 - done     job 5 - 3.4755e-05s

Here are the logs of the producer:

===== PUB test-test
�{"type":"job","payload":1}
===== PUB test-test
�{"type":"job","payload":2}
===== PUB test-test
�{"type":"job","payload":3}
===== PUB test-test
�{"type":"job","payload":4}
E, [2019-01-16T19:52:36.134865 #18336] ERROR -- : [host: 127.0.0.1 port: 4150]  Error received: E_PUB_FAILED PUB failed exiting
W, [2019-01-16T19:52:36.164201 #18336]  WARN -- : [host: 127.0.0.1 port: 4150]  Died from: No data from socket
W, [2019-01-16T19:52:36.170969 #18336]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection reset by peer. Retrying in 0.5 seconds.
W, [2019-01-16T19:52:36.671352 #18336]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection refused - connect(2) for "127.0.0.1" port 4150. Retrying in 0.7 seconds.
===== PUB test-test
�{"type":"job","payload":5}

We can see that the job 4 has never been consumed, and after a short investigation we got it that it has never been emitted. The reason is probably in connection.rb

    def write_loop
      data = nil
      loop do
        data = @write_queue.pop
        break if data == :stop_write_loop
        write_to_socket(data)
      end
    rescue Exception => ex
      # requeue PUB and MPUB commands
      if data =~ /^M?PUB/
        debug "Requeueing to write_queue: #{data.inspect}"
        @write_queue.push(data)
      end
      die(ex)
    end

So it never gets to the requeue exception it fails at the next message with break if data == :stop_write_loop

I've done a small patch with a 'memory' and it works correctly, but I'm sure it's not the best solution (actually according to when it's failing, sometimes the message is handled and sometimes not, how can we check that?):

    def write_loop
      previous = nil
      data = nil
      loop do
        data = @write_queue.pop
        if data == :stop_write_loop
          if previous && previous =~ /^M?PUB/
            debug "Requeueing to write_queue: #{data.inspect}"
            @write_queue.push(previous)
          end
          break
        end
        write_to_socket(data)
        previous = data
      end

Could I have your input on this, it really needs to be fixed, it's unacceptable to loose messages like this.

Thanks a lot (ping @bschwartz )

@Soulou
Copy link
Contributor Author

Soulou commented Jan 16, 2019

What's tricky is that sometimes it fails in the read_loop and sometimes in the write_loop

W, [2019-01-16T20:20:19.190486 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Died from: No data from socket                                                                                        
W, [2019-01-16T20:20:19.191380 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection reset by peer. Retrying in 0.5 seconds.                                                 
W, [2019-01-16T20:20:19.691925 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection refused - connect(2) for "127.0.0.1" port 4150. Retrying in 0.6 seconds.                
W, [2019-01-16T20:20:20.317710 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection reset by peer. Retrying in 1.9 seconds.                                                 

# Restart

E, [2019-01-16T20:20:26.819594 #2132] ERROR -- : [host: 127.0.0.1 port: 4150]  Error received: E_PUB_FAILED PUB failed exiting                                                                       
W, [2019-01-16T20:20:26.847915 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Died from: No data from socket                                                                                        
W, [2019-01-16T20:20:26.849932 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection reset by peer. Retrying in 0.5 seconds.                                                 
W, [2019-01-16T20:20:27.350496 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection refused - connect(2) for "127.0.0.1" port 4150. Retrying in 1.0 seconds.  

# Restart
              
W, [2019-01-16T20:20:32.791497 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Died from: No data from socket                                                                                        
W, [2019-01-16T20:20:32.793216 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection reset by peer. Retrying in 0.5 seconds.                                                 
W, [2019-01-16T20:20:33.293557 #2132]  WARN -- : [host: 127.0.0.1 port: 4150]  Failed to connect: Connection refused - connect(2) for "127.0.0.1" port 4150. Retrying in 1.0 seconds.  

@Soulou
Copy link
Contributor Author

Soulou commented Jan 17, 2019

So after further analysis of other clients, in other languages, it's normal to "fail" producing a message, but in most libs, there is a way to get the error and so retry. Here it's completely silent, that's the main point.

If you know how to solve that and give me a few hint, I'll work on it.

@bschwartz
Copy link
Member

You're right that nsq-ruby doesn't wait for the acknowledgement from nsqd like in some other client libraries. From the README:

Note: We don't wait for nsqd to acknowledge our writes. As a result, if the connection to nsqd fails, you can lose messages. This is acceptable for our use cases, mostly because we are sending messages to a local nsqd instance and failure is very rare.

This was done for simplicity and performance reasons but might not be acceptable for all use cases.

And like you noticed, the retry mechanism isn't perfect because the first few messages that are sent after nsqd is down might get sent without an exception being raised (see: https://github.com/wistia/nsq-ruby/blob/master/spec/lib/nsq/producer_spec.rb#L100).

I haven't really thought about how it would look to add a synchronous writes (where we wait for an OK from nsqd), but if you want to take a crack at implementing something there, I'd be happy to take a look.

I'm sorry this has caused you problems!

@Soulou
Copy link
Contributor Author

Soulou commented Jan 17, 2019

Obviously I didn't notice the warning in the README. Sorry for my reaction in this case I should have been careful. By any chance, would you have an idea about how you would implement this if you would have to do it?

@Soulou
Copy link
Contributor Author

Soulou commented Jan 17, 2019

I've started a PR (work in progress) to take into account errors when data are sent to NSQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants