Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testRebotConnectionTimeouts is sometimes failing #176

Open
killenb opened this issue Sep 2, 2020 · 7 comments
Open

testRebotConnectionTimeouts is sometimes failing #176

killenb opened this issue Sep 2, 2020 · 7 comments
Labels
bug postponed A ticket that has might have been postponed from being worked on for various reasons readyForImplementation This ticket has been moved to the "ready for implementation" area on the MSK software group board selected This ticket has been selected for the MSK software group board

Comments

@killenb
Copy link
Member

killenb commented Sep 2, 2020

Under high system load testRebotConnectionTimeouts is failing on Jenkins. It is especially failing for release builds.

Running 3 test cases...
unknown location(0): fatal error in "testWriteTimeout": std::exception: Bad file descriptor
/scratch/source/tests/executables_src/testRebotConnectionTimeouts.cpp(79): last checkpoint

*** 1 failure detected in test suite "RebotConnectionTimeoutTest"

My previous suspicion that "std::exception: Bad file descriptor" is coming from Jenkins being under too much load and causing random tests to fail seems to be wrong. This exception is coming from the Rebot backend and is triggered by a race condition in the test.

@mhier mhier added selected This ticket has been selected for the MSK software group board bug readyForImplementation This ticket has been moved to the "ready for implementation" area on the MSK software group board labels Sep 14, 2020
@mhier
Copy link
Member

mhier commented Sep 16, 2020

Might be connected with #109, so please do both tickets together.

@jhktimm jhktimm self-assigned this Sep 21, 2020
@phako phako self-assigned this Sep 22, 2020
@phako
Copy link
Member

phako commented Sep 23, 2020

I fixed #109 and had the tests run in parallel on my machine with stress-ng in the background, no failure so far. I'd close this and we can re-open/create a new ticket if we see it again

@phako phako closed this as completed Sep 23, 2020
@mhier
Copy link
Member

mhier commented Sep 23, 2020

I just saw on Jenkins that the analysis job failed due to an error in one of the Rebot tests:

test 49
      Start 49: testRebotBackend

49: Test command: /scratch/build-ChimeraTK-DeviceAccess/scripts/testRebotBackend.sh
49: Test timeout computed to be: 9.99988e+06
49: Using server port 
49: Using dmap file ./uotjfgttg.dmap
49: Running 2 test cases...
49: /scratch/source/tests/unitTestsNotUnderCtest/testRebotBackend.cpp(119): error in "RebotTestClass::testConnection": exception thrown by rebotBackend.open()
49: /scratch/source/tests/unitTestsNotUnderCtest/testRebotBackend.cpp(121): error in "RebotTestClass::testConnection": check rebotBackend.isOpen() == true failed [false != true]
49: /scratch/source/tests/unitTestsNotUnderCtest/testRebotBackend.cpp(124): error in "RebotTestClass::testConnection": exception thrown by rebotBackend.open()
49: /scratch/source/tests/unitTestsNotUnderCtest/testRebotBackend.cpp(125): error in "RebotTestClass::testConnection": check rebotBackend.isOpen() == true failed [false != true]
49: /scratch/source/tests/unitTestsNotUnderCtest/testRebotBackend.cpp(127): error in "RebotTestClass::testConnection": exception thrown by rebotBackend.close()
49: /scratch/source/tests/unitTestsNotUnderCtest/testRebotBackend.cpp(132): error in "RebotTestClass::testConnection": exception thrown by rebotBackend.close()
49: unknown location(0): fatal error in "RebotTestClass::testReadWriteAPIOfRebotBackend": std::runtime_error: shutdown: Transport endpoint is not connected
49: /scratch/source/tests/unitTestsNotUnderCtest/testRebotBackend.cpp(133): last checkpoint
49: 
49: *** 7 failures detected in test suite "Rebot backend test suite"
49: Testing backed using the IP addess with protocol version 0 failed!

@mhier mhier reopened this Sep 23, 2020
@phako
Copy link
Member

phako commented Sep 23, 2020

That's an unfortunate race condition in the shell script I did not forsee and interestingly didn't happen even with maxed out cpus

@phako phako closed this as completed in bbf050c Sep 23, 2020
@phako phako reopened this Sep 28, 2020
@phako
Copy link
Member

phako commented Sep 28, 2020

Back in ChimeraTK-DeviceAccess-analysis Build 621:

unknown location(0): fatal error in "testWriteTimeout": std::exception: Bad file descriptor
12: /scratch/source/tests/executables_src/testRebotConnectionTimeouts.cpp(79): last checkpoint
12:

@phako
Copy link
Member

phako commented Sep 28, 2020

Looking at the rebot client code - in very bad circumstances we might be sending the hello request to the server before the socket is actually connected. there is no check that the async_connect() is actually finished. But that again results in a ctk::runtime_error, not std::runtime_error

@phako
Copy link
Member

phako commented Sep 29, 2020

The std::runtime_error points towards the dummy server because problems in the client will result in ctk::runtime_error.

EBADF probably is trying to access a not-yet-ready socket. Exception happens in the call to device->open() since that is between the two checkpoints in line 79 and 83.

Maybe it is also related to the fact that this is the point where the server is started the second time in the test and there's some dangling initialization.

I could, however, not really reproduce it with load, without load, with or without ASAN, in docker or outside docker.

@phako phako removed their assignment Sep 29, 2020
@phako phako added the postponed A ticket that has might have been postponed from being worked on for various reasons label Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug postponed A ticket that has might have been postponed from being worked on for various reasons readyForImplementation This ticket has been moved to the "ready for implementation" area on the MSK software group board selected This ticket has been selected for the MSK software group board
Projects
None yet
Development

No branches or pull requests

4 participants