Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PcieBackend spurious crash during recovery #197

Open
mhier opened this issue Nov 19, 2020 · 3 comments
Open

PcieBackend spurious crash during recovery #197

mhier opened this issue Nov 19, 2020 · 3 comments
Labels

Comments

@mhier
Copy link
Member

mhier commented Nov 19, 2020

Due to another bug, I had a server in a "recovery loop": The devices were switched to error state via setException() and then recovered again via open() in an endless loop. Occasionally (quite rare actually), the a crash "double free or corruption (fasttop)" happened with the following backtrace:

#0  0x00007f6af659c438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f6af659e03a in __GI_abort () at abort.c:89
#2  0x00007f6af65de7fa in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f6af66f7f98 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007f6af65e738a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7f6af66f8060 "double free or corruption (fasttop)", action=3)
    at malloc.c:5020
#4  _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3874
#5  0x00007f6af65eb58c in __GI___libc_free (mem=<optimized out>) at malloc.c:2975
#6  0x00007f6afb00092d in boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::_bi::value<unsigned long> > > >::manager (op=<optimized out>, out_buffer=..., in_buffer=...) at /usr/include/boost/function/function_base.hpp:389
#7  boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::_bi::value<unsigned long> > > >::manager (
    op=<optimized out>, out_buffer=..., in_buffer=...) at /usr/include/boost/function/function_base.hpp:412
#8  boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::_bi::value<unsigned long> > > >::manage (in_buffer=..., 
    out_buffer=..., op=<optimized out>) at /usr/include/boost/function/function_base.hpp:440
#9  0x00007f6afaffea44 in boost::detail::function::basic_vtable4<void, unsigned char, unsigned int, int*, unsigned long>::clear (this=<optimized out>, functor=...)
    at /usr/include/boost/function/function_template.hpp:510
#10 boost::function4<void, unsigned char, unsigned int, int*, unsigned long>::clear (this=0x7f688a59b330) at /usr/include/boost/function/function_template.hpp:883
#11 boost::function4<void, unsigned char, unsigned int, int*, unsigned long>::~function4 (this=0x7f688a59b330, __in_chrg=<optimized out>)
    at /usr/include/boost/function/function_template.hpp:765
#12 boost::function<void (unsigned char, unsigned int, int*, unsigned long)>::~function() (this=0x7f688a59b330, __in_chrg=<optimized out>)
    at /usr/include/boost/function/function_template.hpp:1056
#13 boost::function<void (unsigned char, unsigned int, int*, unsigned long)>::operator=<boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::arg<4> > > >(boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::arg<4> > >) (f=..., this=0x1e554b0) at /usr/include/boost/function/function_template.hpp:1132
#14 ChimeraTK::PcieBackend::determineDriverAndConfigureIoctl (this=this@entry=0x1e55250)
    at /build/libchimeratk-deviceaccess-02.01xenial1.01/device_backends/pcie/src/PcieBackend.cc:80
#15 0x00007f6afafffd0a in ChimeraTK::PcieBackend::open (this=0x1e55250) at /build/libchimeratk-deviceaccess-02.01xenial1.01/device_backends/pcie/src/PcieBackend.cc:40
#16 0x00007f6afae6ba8a in ChimeraTK::LogicalNameMappingBackend::open (this=0x2952210)
    at /build/libchimeratk-deviceaccess-02.01xenial1.01/device_backends/LogicalNameMapping/src/LogicalNameMappingBackend.cc:46
#17 0x00007f6af860a5e2 in ChimeraTK::DeviceModule::handleException() () from /usr/lib/libChimeraTK-ApplicationCore.so.02.00xenial3
#18 0x00007f6afb4b65d5 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.58.0
#19 0x00007f6af99a86ba in start_thread (arg=0x7f688a59c700) at pthread_create.c:333
#20 0x00007f6af666e4dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The server was the llrfctrl server at A0M, and it was stuck in the recovery loop because one ADC board was powered down through the MCH.

@mhier mhier added the bug label Nov 19, 2020
@mhier
Copy link
Member Author

mhier commented Nov 25, 2020

A theory how this bug happens:

When using the Logical Name Mapper, it may happen that the same PcieBackend instance is used in two LNM devices. In ApplicationCore, this connection is not known. Hence the two DeviceModules might attempt to recover the same PcieBackend concurrently (indirectly through the logical device).

Since open() is not considered to be thread safe, this is not allowed. On the other hand, the application has no way of knowing about this entanglement. Not sure how to best solve this problem. Either ApplicationCore (and basically any application) has to make sure, no device is concurrently opened/recovered with any other device, or we have to change the requirement and expect open() to be thread safe.

Note: The logical name mapping backend cannot fix this. One of the usages could be direct, without a LNM backend in between, so it is impossible to know for the LNM backend if a concurrent open() is currently in progress.

@killenb
Copy link
Member

killenb commented Nov 25, 2020

As the problem you describe only can happen when using the LNM, we could require that you have to use LNM for all devices in your application. Then we could build it into the LNM.

Or we put this task to each application and put the same mechanism into ApplicationCore. The call to open() in the DeviceModule could be surrounded by a global mutex, which makes all recoveries/initialisations sequential.

@mhier
Copy link
Member Author

mhier commented Nov 25, 2020

I don't like the first option (force using LNM for all devices), since this can easily be forgotten and somehow contradicts our principle of abstraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants