-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault when using mini-ranks-per-rank #220
Comments
Hi, With 2.2.0, it works for me: Command: #!/bin/bash rm -rf popo8 mpiexec -n 1 ./Ray -mini-ranks-per-rank 8 -o popo8 (this is from https://github.com/sebhtml/Ray-TestSuite/blob/master/robustness-tests/test-mini-ranks.sh ) I can see the 900% CPU t utilization here: Tasks: 885 total, 2 running, 881 sleeping, 0 stopped, 2 zombie PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND |
With 2.3.0, I confirm that the bug is reproducible. I added it to the milestone 2.3.1. I get: [cp1833:14296] [ 0] /lib64/libpthread.so.0() [0x365260f500] |
Trace: (gdb) bt |
There is a buffer problem: Ray: RayPlatform/memory/RingAllocator.cpp:265: int RingAllocator::getBufferHandle(void*): Assertion `bufferValue >= originValue' failed. (difference: 0xFB048) Full stack trace: (gdb) bt |
Hi, I (probably) found the issue. In the message handler code, this was used to register the dirty buffer: request = this->registerMessageBuffer(buffer, m_rank, destination, However, buffer is not thread-safe. Working on a patch now. |
This patch does not yet fix the issue though. Link: sebhtml/ray#220 Reported-by: Bastien Chevreux <[email protected]> Signed-off-by: Sébastien Boisvert <[email protected]>
3 Mixed messages with tag 17: Ray: RayPlatform/communication/Message.cpp:533: void Message::runAssertions(int, bool, bool): Assertion 44 MiniRankDestination: SourceActor: 00 Buffer: DestinationActor: 0x2b5a88f939400 with RoutingSource: 44 bytes : 0 RoutingDestination: 0 MiniRankSource: 0 0x15 0x00 0x00 0x00 0x00 0x00 0x00 MiniRankDestination: 0 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x44 MiniRankDestination: SourceActor: 00 Buffer: DestinationActor: 0x2b5a88f939400 with RoutingSource: 44 bytes : 0 RoutingDestination: 0 MiniRankSource: 0 0x15 0x00 0x00 0x00 0x00 0x00 0x00 MiniRankDestination: 0 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 Buffer: 0x1213ca0 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 Original messages: [Communication] 27 microseconds, SEND Source: 0 Destination: 0 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo [Communication] 27 microseconds, SEND Source: 0 Destination: 1 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo [Communication] 27 microseconds, SEND Source: 0 Destination: 2 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo Original message buffer (analysis) from 0 to 2: Message has 44 bytes 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 // kmer length is 21 Received message: 44 bytes: |
Link: sebhtml/ray#220 Reported-by: Bastien Chevreux <[email protected]> Signed-off-by: Sébastien Boisvert <[email protected]>
I think this is the only related issue: Rank 1 registered its seeds Related?: #222 |
Tag is 224. (gdb) bt |
When using
mpiexec -n 1 /opt/biosw/ray/Ray -mini-ranks-per-rank 3 -o test -p f1.fastq f2.fastq -k 31
I get segfaults (see below) when running Ray 2.3.0 and Ray 2.2.0, Reproduced on Kubuntu 9.10 and 12.04 (two different machines).
Best,
B.
Rank 0 wrote test/RayCommand.txt
k-mer length: 31
Rank 1: assembler memory usage: 257772 KiB
Rank 2: assembler memory usage: 323312 KiB
Rank 0: assembler memory usage: 405240 KiB
Rank 1: assembler memory usage: 470780 KiB
Rank 1: Rank= 1 Size= 3 ProcessIdentifier= 10908
Rank 2: assembler memory usage: 470780 KiB
Rank 2: Rank= 2 Size= 3 ProcessIdentifier= 10908
Rank 0: assembler memory usage: 470780 KiB
Rank 0: Rank= 0 Size= 3 ProcessIdentifier= 10908
Rank 0: testing the network, please wait...
[arcadia:10908] *** Process received signal ***
[arcadia:10908] Signal: Segmentation fault (11)
[arcadia:10908] Signal code: Address not mapped (1)
[arcadia:10908] Failing at address: 0x20abc24e0
[arcadia:10908] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f5e90f23cb0]
[arcadia:10908] [ 1] /opt/biosw/ray-2.3.0/Ray(_ZN11DirtyBuffer9getBufferEv+0) [0x5976d0]
[arcadia:10908] [ 2] /opt/biosw/ray-2.3.0/Ray(_ZN13RingAllocator14registerBufferEPv+0x31) [0x596761]
[arcadia:10908] [ 3] /opt/biosw/ray-2.3.0/Ray(_ZN15MessagesHandler23sendMessagesForMiniRankEP12MessageQueueP13RingAllocatori+0x4d) [0x5ae73d]
[arcadia:10908] [ 4] /opt/biosw/ray-2.3.0/Ray(_ZN15MessagesHandler36sendAndReceiveMessagesForRankProcessEPP11ComputeCoreiPb+0x9c) [0x5aebac]
[arcadia:10908] [ 5] /opt/biosw/ray-2.3.0/Ray(main+0x1ed) [0x47182d]
[arcadia:10908] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f5e90b7576d]
[arcadia:10908] [ 7] /opt/biosw/ray-2.3.0/Ray() [0x4732d1]
[arcadia:10908] *** End of error message ***
mpiexec noticed that process rank 0 with PID 10908 on node arcadia exited on signal 11 (Segmentation fault).
The text was updated successfully, but these errors were encountered: