-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost messages on failed PUB/MPUB on nsqd restart/connection break due to load #51
Comments
What's tricky is that sometimes it fails in the
|
So after further analysis of other clients, in other languages, it's normal to "fail" producing a message, but in most libs, there is a way to get the error and so retry. Here it's completely silent, that's the main point. If you know how to solve that and give me a few hint, I'll work on it. |
You're right that
This was done for simplicity and performance reasons but might not be acceptable for all use cases. And like you noticed, the retry mechanism isn't perfect because the first few messages that are sent after I haven't really thought about how it would look to add a synchronous writes (where we wait for an I'm sorry this has caused you problems! |
Obviously I didn't notice the warning in the README. Sorry for my reaction in this case I should have been careful. By any chance, would you have an idea about how you would implement this if you would have to do it? |
I've started a PR (work in progress) to take into account errors when data are sent to NSQ. |
Basically we were loosing messages in our infrastructure, this is the result of my investigation.
I did a PoC producer writing one message every second, and one consumer consuming these messages, and during the process I was restarting the nsqd instance.
The result was pretty straightforward, here are the consumer logs:
Here are the logs of the producer:
We can see that the job 4 has never been consumed, and after a short investigation we got it that it has never been emitted. The reason is probably in
connection.rb
So it never gets to the requeue exception it fails at the next message with
break if data == :stop_write_loop
I've done a small patch with a 'memory' and it works correctly, but I'm sure it's not the best solution (actually according to when it's failing, sometimes the message is handled and sometimes not, how can we check that?):
Could I have your input on this, it really needs to be fixed, it's unacceptable to loose messages like this.
Thanks a lot (ping @bschwartz )
The text was updated successfully, but these errors were encountered: