Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport fix libzmq #1584 #135

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

msune
Copy link
Contributor

@msune msune commented Aug 1, 2016

Backport fix to 4.1.X "fixed zmq assertion in signaler.cpp under ubuntu #1584"

Backport fix "fixed zmq assertion in signaler.cpp under ubuntu
\#1584"
@somdoron
Copy link
Member

somdoron commented Aug 1, 2016

Anyway to test it?

@msune
Copy link
Contributor Author

msune commented Aug 1, 2016

I am not sure.

This assertion was happening in a not-so-fast system with a very heavy load of messages. I applied (manually):
zeromq/libzmq#1584 solves the issue.

It is a pretty straight back-port. However, after leaving the system for some time, another assertion is hit (immediately on the return of that function):

(gdb) bt
#0  0x000000000242c07b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x00000000024a28d8 in abort ()
#2  0x00000000023779cb in zmq::zmq_abort (errmsg_=0x2e360a2 "ok") at /home/marc/volta/builder/libzmq/src/err.cpp:83
#3  0x0000000002378304 in zmq::mailbox_t::recv (this=0x7fffdc009da0, cmd_=0x7fffe7ffe7b0, timeout_=0) at /home/marc/volta/builder/libzmq/src/mailbox.cpp:94
#4  0x0000000002388998 in zmq::socket_base_t::process_commands (this=0x7fffdc0099e0, timeout_=0, throttle_=true) at /home/marc/volta/builder/libzmq/src/socket_base.cpp:1044
#5  0x000000000238820e in zmq::socket_base_t::send (this=0x7fffdc0099e0, msg_=0x7fffe7ffe970, flags_=1) at /home/marc/volta/builder/libzmq/src/socket_base.cpp:829
#6  0x000000000236af19 in s_sendmsg (s_=0x7fffdc0099e0, msg_=0x7fffe7ffe970, flags_=1) at /home/marc/volta/builder/libzmq/src/zmq.cpp:346

That is:

68│ int zmq::mailbox_t::recv (command_t *cmd_, int timeout_)
69│ {
70│     //  Try to get the command straight away.
71│     if (active) {
72│         if (cpipe.read (cmd_))
73│             return 0;
74│
75│         //  If there are no more commands available, switch into passive state.
76│         active = false;
77│     }
78│
79│     //  Wait for signal from the command sender.
80│     const int rc = signaler.wait (timeout_);
81│     if (rc == -1) {
82│         errno_assert (errno == EAGAIN || errno == EINTR);
83│         return -1;
84│     }
85│
86│     //  Receive the signal.
87│     signaler.recv ();
88│
89│     //  Switch into active state.
90│     active = true;
91│
92│     //  Get a command.
93│     const bool ok = cpipe.read (cmd_);
94├>    zmq_assert (ok);

So I guess fix is not complete and problems are of a bigger scope... I am not sure if moving to 4.2.X (but I was experiencing similar issues, that's why I downgraded to 4.1.6).

For the original assertion, I think EAGAIN and EINTR, at least should be treated, as read() can fail.

You can reject the PR if you wish. But please let me know if there is some fix available somwhere for this, or if I have to implement a retrial to handle these 2 error codes and contribute back.

@sourcedelica
Copy link

Did you end up successfully porting this to 4.1.x? We've run into #1583 on 4.1.x too. I tried building my app with with 4.2.2 but ran into lots of problems right off the bat. Trying to decide now whether to forge ahead with 4.2.x or try to get this working in 4.1.x.

@bluca
Copy link
Member

bluca commented Jul 10, 2017

What problems are you having with 4.2.x? It's API/ABI compatible so it should be a smooth upgrade from 4.1.x

@sourcedelica
Copy link

Hi Luca - thanks for checking with me on this.

I ran the the tests for our library that is a layer on top of ZMQ using 4.1.6 plus the patch from this PR and using 4.2.2.

I manually applied the change from this PR to 4.1.6 and our tests ran OK though I've only run them a few times.

For 4.2.2, I've found that most of the ZMQ-related tests pass, but there are a few that crash each time they run. Here are the backtraces cut off after the first custom class.

It's quite possible that the bug is in our code but is latent in 4.1.6.

#0  0x00007ffff1925a98 in raise () from /lib64/libc.so.6
#1  0x00007ffff192772a in abort () from /lib64/libc.so.6
#2  0x0000000000e90029 in zmq::zmq_abort (errmsg_=errmsg_@entry=0x7ffff1a778a2 "Bad address")
    at src/err.cpp:87
#3  0x0000000000eb1495 in zmq::fq_t::recvpipe (this=this@entry=0x7fffdedb2fb8, 
    msg_=msg_@entry=0x7fffdedb3040, pipe_=pipe_@entry=0x7fffda5d49b8) at src/fq.cpp:92
#4  0x0000000000eb8ed2 in zmq::router_t::xhas_in (this=0x7fffdedb2b00) at src/router.cpp:385
#5  0x0000000000e9bb06 in has_in (this=0x7fffdedb2b00) at src/socket_base.cpp:1281
#6  zmq::socket_base_t::getsockopt (this=0x7fffdedb2b00, option_=option_@entry=15, 
    optval_=optval_@entry=0x7fffda5d4a54, optvallen_=optvallen_@entry=0x7fffda5d4a58)
    at src/socket_base.cpp:417
#7  0x0000000000e83d50 in zmq::socket_poller_t::wait (this=0x7fffdedaffc8, 
    events_=0x7fffdedabf40, n_events_=6, timeout_=-1) at src/socket_poller.cpp:455
#8  0x0000000000e8326d in zmq_poller_poll (timeout_=<optimized out>, nitems_=<optimized out>, 
    items_=0x7fffda5d4b80) at src/zmq.cpp:816
#9  zmq_poll (items_=0x7fffda5d4b80, nitems_=<optimized out>, timeout_=-1) at src/zmq.cpp:866
#10 0x0000000000a2fb55 in zmq::poll (items_=0x7fffda5d4b80, nitems_=6, timeout_=-1)
    at /media/sf_work/myapp/third-party/zmq/include/zmq.hpp:144
#11 0x0000000000c4a4b4 in myapp::service::common::ZmqLoop::poll (this=0x7fffef33f100, 
    items=0x7fffda5d4ca0, nitems=5, timeout=-1) at ZmqLoop.cpp:126
#0  0x0000000000e9b886 in zmq::socket_base_t::process_commands (this=this@entry=0x7fffe0900f00, 
    timeout_=timeout_@entry=0, throttle_=throttle_@entry=false) at src/socket_base.cpp:1355
#1  0x0000000000e9ba6f in zmq::socket_base_t::getsockopt (this=0x7fffe0900f00, 
    option_=option_@entry=15, optval_=optval_@entry=0x7fffdac5ea54, 
    optvallen_=optvallen_@entry=0x7fffdac5ea58) at src/socket_base.cpp:409
#2  0x0000000000e83d50 in zmq::socket_poller_t::wait (this=0x7fffe08dbd28, 
    events_=0x7fffe08e7f40, n_events_=6, timeout_=-1) at src/socket_poller.cpp:455
#3  0x0000000000e8326d in zmq_poller_poll (timeout_=<optimized out>, nitems_=<optimized out>, 
    items_=0x7fffdac5eb80) at src/zmq.cpp:816
#4  zmq_poll (items_=0x7fffdac5eb80, nitems_=<optimized out>, timeout_=-1) at src/zmq.cpp:866
#5  0x0000000000a2fb55 in zmq::poll (items_=0x7fffdac5eb80, nitems_=6, timeout_=-1)
    at /media/sf_work/myapp/third-party/zmq/include/zmq.hpp:144
#6  0x0000000000c4a4b4 in myapp::service::common::ZmqLoop::poll (this=0x7ffff7f87380, 
    items=0x7fffdac5eca0, nitems=5, timeout=-1) at ZmqLoop.cpp:126
#0  0x00007ffff1925a98 in raise () from /lib64/libc.so.6
#1  0x00007ffff192772a in abort () from /lib64/libc.so.6
#2  0x0000000000e90029 in zmq::zmq_abort (errmsg_=errmsg_@entry=0xf61969 "pfd.revents & POLLIN")
    at src/err.cpp:87
#3  0x0000000000e99ce9 in zmq::signaler_t::wait (this=this@entry=0x7fffd58ce568, timeout_=0)
    at src/signaler.cpp:249
#4  0x0000000000e90765 in zmq::mailbox_t::recv (this=0x7fffd58ce500, cmd_=0x7fffd8256940, 
    timeout_=<optimized out>) at src/mailbox.cpp:81
#5  0x0000000000e9b88c in zmq::socket_base_t::process_commands (this=this@entry=0x7fffe07ef200, 
    timeout_=timeout_@entry=0, throttle_=throttle_@entry=false) at src/socket_base.cpp:1355
#6  0x0000000000e9ba6f in zmq::socket_base_t::getsockopt (this=0x7fffe07ef200, 
    option_=option_@entry=15, optval_=optval_@entry=0x7fffd8256a54, 
    optvallen_=optvallen_@entry=0x7fffd8256a58) at src/socket_base.cpp:409
#7  0x0000000000e83d50 in zmq::socket_poller_t::wait (this=0x7fffd5997fc8, 
    events_=0x7fffd58d3f40, n_events_=6, timeout_=-1) at src/socket_poller.cpp:455
#8  0x0000000000e8326d in zmq_poller_poll (timeout_=<optimized out>, nitems_=<optimized out>, 
    items_=0x7fffd8256b80) at src/zmq.cpp:816
#9  zmq_poll (items_=0x7fffd8256b80, nitems_=<optimized out>, timeout_=-1) at src/zmq.cpp:866
#10 0x0000000000a2fb55 in zmq::poll (items_=0x7fffd8256b80, nitems_=6, timeout_=-1)
    at /media/sf_work/myapp/third-party/zmq/include/zmq.hpp:144
#11 0x0000000000c4a4b4 in myapp::service::common::ZmqLoop::poll (this=0x7fffd5913880, 
    items=0x7fffd8256ca0, nitems=5, timeout=-1) at ZmqLoop.cpp:126

@bluca
Copy link
Member

bluca commented Jul 11, 2017

Are those tests by any chance using/creating/closing the same socket from multiple threads?

The first abort is because the (internal) message is trying to close is invalid (most likely closed from somewhere else already), the second one has an invalid poll

@sourcedelica
Copy link

Thanks! It turned out to be a race condition that closed sockets from another code path. Due to the race it didn't manifest under 4.1.x. I've fixed it on our side and now all of our tests pass using 4.2.2. I am going to recommend to my team that we upgrade to 4.2.2 to get this patch (for #1584) plus all of the other fixes that came with 4.2.x.

@bluca
Copy link
Member

bluca commented Jul 12, 2017

Good to hear.
Incidentally, did you manage to reproduce this issue on 4.1, and can you confirm whether or not this PR fixes it?

@sourcedelica
Copy link

The problem (Resource Temporarily Unavailable in signaler.cpp) has only come up once in QA testing. We have not seen in in developer testing. Based on the assertion message we found the bug and the PR in the main libzmq repo and then this backport PR.

I applied the backport and our test suite ran fine. I was going to go ahead and upgrade to 4.2.2, but I ran into an issue that 4.2.2 does not compile with Visual Studio 8 (2005). We primarily use GCC 5.2 on Linux but have some apps using VS8 on Windows, and 4.1.x compiles with VS8. It sucks having to support VS8 but I may end up needing to go with 4.1.x anyway.

@bluca
Copy link
Member

bluca commented Jul 12, 2017

What's the problem with VS8?

@somdoron given the test done by @sourcedelica I would say it's enough for merging this, what do you think?

@sourcedelica
Copy link

Here's some of the errors building 4.2.2 using VS8 (2005):

Target Build:
    Target ZERO_CHECK:
        C:\Program Files (x86)\Microsoft Visual Studio 8\Common7\IDE\..\..\vc\vc
packages\vcbuild.exe C:\dev\zeromq-4.2.2\build-2005\ZERO_CHECK.vcproj "Debug|Win
32"
    Target libzmq:
        C:\Program Files (x86)\Microsoft Visual Studio 8\Common7\IDE\..\..\vc\vc
packages\vcbuild.exe C:\dev\zeromq-4.2.2\build-2005\libzmq.tmp_Debug_Win32.vcpro
j "Debug|Win32"
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(61): error C2061: syntax error : ide
ntifier 'uint32_t'
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C2146: syntax error : mis
sing ';' before identifier 'zmq_msg_routing_id'
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type speci
fier - int assumed. Note: C++ does not support default-int
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type speci
fier - int assumed. Note: C++ does not support default-int
        c:\dev\zeromq-4.2.2\src\precompiled.hpp(64): fatal error C1083: Cannot o
pen include file: 'stdint.h': No such file or directory
    Done building target "libzmq" in project "ZeroMQ.sln" -- FAILED.
    Target RUN_TESTS:
        The project "RUN_TESTS" is not selected for building in solution configu
ration "Debug|Win32".
    Target libzmq-static:
        C:\Program Files (x86)\Microsoft Visual Studio 8\Common7\IDE\..\..\vc\vc
packages\vcbuild.exe C:\dev\zeromq-4.2.2\build-2005\libzmq-static.tmp_Debug_Win3
2.vcproj "Debug|Win32"
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(61): error C2061: syntax error : ide
ntifier 'uint32_t'
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C2146: syntax error : mis
sing ';' before identifier 'zmq_msg_routing_id'
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type speci
fier - int assumed. Note: C++ does not support default-int
        c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type speci
fier - int assumed. Note: C++ does not support default-int
        c:\dev\zeromq-4.2.2\src\precompiled.hpp(64): fatal error C1083: Cannot o
pen include file: 'stdint.h': No such file or directory
    Done building target "libzmq-static" in project "ZeroMQ.sln" -- FAILED.
Done building target "Build" in project "ZeroMQ.sln" -- FAILED.

Done building project "ZeroMQ.sln" -- FAILED.

Build FAILED.
c:\dev\zeromq-4.2.2\src\zmq_draft.h(61): error C2061: syntax error : identifier
'uint32_t'
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C2146: syntax error : missing ';'
 before identifier 'zmq_msg_routing_id'
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type specifier - i
nt assumed. Note: C++ does not support default-int
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type specifier - i
nt assumed. Note: C++ does not support default-int
c:\dev\zeromq-4.2.2\src\precompiled.hpp(64): fatal error C1083: Cannot open incl
ude file: 'stdint.h': No such file or directory
c:\dev\zeromq-4.2.2\src\zmq_draft.h(61): error C2061: syntax error : identifier
'uint32_t'
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C2146: syntax error : missing ';'
 before identifier 'zmq_msg_routing_id'
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type specifier - i
nt assumed. Note: C++ does not support default-int
c:\dev\zeromq-4.2.2\src\zmq_draft.h(62): error C4430: missing type specifier - i
nt assumed. Note: C++ does not support default-int
c:\dev\zeromq-4.2.2\src\precompiled.hpp(64): fatal error C1083: Cannot open incl
ude file: 'stdint.h': No such file or directory
    0 Warning(s)
    10 Error(s)

@bluca
Copy link
Member

bluca commented Jul 12, 2017

I would suggest checking for existing issues on the libzmq repository, and if there's none opening one - sorry but I'm not a windows person so can't help too much with that

@sourcedelica
Copy link

Me neither :) I'll check in the libzmq issues. It looks like most of the errors are localized to a couple places.

Thanks a lot for your help!

@bluca
Copy link
Member

bluca commented Jul 21, 2017

@somdoron ping - you think these tests are enough for a merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants