Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when using mini-ranks-per-rank #220

Closed
DrMicrobit opened this issue Nov 16, 2013 · 9 comments
Closed

Segfault when using mini-ranks-per-rank #220

DrMicrobit opened this issue Nov 16, 2013 · 9 comments
Milestone

Comments

@DrMicrobit
Copy link

When using
mpiexec -n 1 /opt/biosw/ray/Ray -mini-ranks-per-rank 3 -o test -p f1.fastq f2.fastq -k 31

I get segfaults (see below) when running Ray 2.3.0 and Ray 2.2.0, Reproduced on Kubuntu 9.10 and 12.04 (two different machines).

Best,
B.

Rank 0 wrote test/RayCommand.txt

k-mer length: 31
Rank 1: assembler memory usage: 257772 KiB
Rank 2: assembler memory usage: 323312 KiB
Rank 0: assembler memory usage: 405240 KiB
Rank 1: assembler memory usage: 470780 KiB
Rank 1: Rank= 1 Size= 3 ProcessIdentifier= 10908
Rank 2: assembler memory usage: 470780 KiB
Rank 2: Rank= 2 Size= 3 ProcessIdentifier= 10908
Rank 0: assembler memory usage: 470780 KiB
Rank 0: Rank= 0 Size= 3 ProcessIdentifier= 10908
Rank 0: testing the network, please wait...

[arcadia:10908] *** Process received signal ***
[arcadia:10908] Signal: Segmentation fault (11)
[arcadia:10908] Signal code: Address not mapped (1)
[arcadia:10908] Failing at address: 0x20abc24e0
[arcadia:10908] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f5e90f23cb0]
[arcadia:10908] [ 1] /opt/biosw/ray-2.3.0/Ray(_ZN11DirtyBuffer9getBufferEv+0) [0x5976d0]
[arcadia:10908] [ 2] /opt/biosw/ray-2.3.0/Ray(_ZN13RingAllocator14registerBufferEPv+0x31) [0x596761]
[arcadia:10908] [ 3] /opt/biosw/ray-2.3.0/Ray(_ZN15MessagesHandler23sendMessagesForMiniRankEP12MessageQueueP13RingAllocatori+0x4d) [0x5ae73d]
[arcadia:10908] [ 4] /opt/biosw/ray-2.3.0/Ray(_ZN15MessagesHandler36sendAndReceiveMessagesForRankProcessEPP11ComputeCoreiPb+0x9c) [0x5aebac]
[arcadia:10908] [ 5] /opt/biosw/ray-2.3.0/Ray(main+0x1ed) [0x47182d]
[arcadia:10908] [ 6] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f5e90b7576d]
[arcadia:10908] [ 7] /opt/biosw/ray-2.3.0/Ray() [0x4732d1]

[arcadia:10908] *** End of error message ***

mpiexec noticed that process rank 0 with PID 10908 on node arcadia exited on signal 11 (Segmentation fault).

@sebhtml
Copy link
Owner

sebhtml commented Nov 18, 2013

Hi,

With 2.2.0, it works for me:

Command:

#!/bin/bash

rm -rf popo8

mpiexec -n 1 ./Ray -mini-ranks-per-rank 8 -o popo8
-p data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq \

(this is from https://github.com/sebhtml/Ray-TestSuite/blob/master/robustness-tests/test-mini-ranks.sh )

I can see the 900% CPU t utilization here:

Tasks: 885 total, 2 running, 881 sleeping, 0 stopped, 2 zombie
Cpu(s): 32.0%us, 4.8%sy, 0.0%ni, 63.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32874744k total, 8439424k used, 24435320k free, 13720k buffers
Swap: 50331640k total, 0k used, 50331640k free, 4763576k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12944 boisver1 20 0 2019m 1.7g 4564 R 897.0 5.5 18:39.62 Ray <=======================

@sebhtml
Copy link
Owner

sebhtml commented Nov 18, 2013

With 2.3.0, I confirm that the bug is reproducible. I added it to the milestone 2.3.1.

I get:

[cp1833:14296] [ 0] /lib64/libpthread.so.0() [0x365260f500]
[cp1833:14296] [ 1] ./Ray(_ZN11DirtyBuffer9getBufferEv+0) [0x5a0880]
[cp1833:14296] [ 2] ./Ray(_ZN13RingAllocator14registerBufferEPv+0x40) [0x59f930]
[cp1833:14296] [ 3] ./Ray(_ZN15MessagesHandler23sendMessagesForMiniRankEP12MessageQueueP13RingAllocatori+0x4b) [0x5b82ab]
[cp1833:14296] [ 4] ./Ray(_ZN15MessagesHandler36sendAndReceiveMessagesForRankProcessEPP11ComputeCoreiPb+0x9c) [0x5b873c]
[cp1833:14296] [ 5] ./Ray(main+0x1d5) [0x484405]
[cp1833:14296] [ 6] /lib64/libc.so.6(__libc_start_main+0xfd) [0x365221ecdd]
[cp1833:14296] [ 7] ./Ray() [0x481631]
[cp1833:14296] *** End of error message ***
Rank 1: assembler memory usage: 734136 KiB
Rank 1: Rank= 1 Size= 8 ProcessIdentifier= 14296

@sebhtml
Copy link
Owner

sebhtml commented Nov 18, 2013

Trace:

(gdb) bt
#0 DirtyBuffer::getBuffer (this=0x2b8e28671e80) at RayPlatform/memory/DirtyBuffer.cpp:70
#1 0x00000000005c5242 in RingAllocator::registerBuffer (this=0x2b94a18f5220, buffer=0x2b94d14f5ff8) at RayPlatform/memory/RingAllocator.cpp:491
#2 0x00000000005e34ad in registerMessageBuffer (this=0x7fff4b0ee4f8, outbox=0x2b94a18f5208, outboxBufferAllocator=0x2b94a18f5220, miniRanksPerRank=8)
at RayPlatform/communication/MessagesHandler.cpp:975
#3 MessagesHandler::sendMessagesForMiniRank (this=0x7fff4b0ee4f8, outbox=0x2b94a18f5208, outboxBufferAllocator=0x2b94a18f5220, miniRanksPerRank=8)
at RayPlatform/communication/MessagesHandler.cpp:230
#4 0x00000000005e38f4 in MessagesHandler::sendAndReceiveMessagesForRankProcess (this=0x7fff4b0ee4f8, cores=0x7fff4b0ef900, miniRanksPerRank=8, communicate=0x7fff4b0ee4f0)
at RayPlatform/communication/MessagesHandler.cpp:77
#5 0x0000000000487c14 in startMiniRanks (this=0x7fff4b0ee4e0) at RayPlatform/RayPlatform/core/RankProcess.h:288
#6 RankProcess::run (this=0x7fff4b0ee4e0) at RayPlatform/RayPlatform/core/RankProcess.h:232
#7 0x0000000000487f67 in main (argc=8, argv=0x7fff4b0efc08) at code/application_core/ray_main.cpp:32

@sebhtml
Copy link
Owner

sebhtml commented Nov 18, 2013

There is a buffer problem:

Ray: RayPlatform/memory/RingAllocator.cpp:265: int RingAllocator::getBufferHandle(void*): Assertion `bufferValue >= originValue' failed.
Error: buffer is too low:
0x2b5538d60ff8 but base is 0x2b5538e5c040

(difference: 0xFB048)

Full stack trace:

(gdb) bt
#0 0x00000036522328a5 in raise () from /lib64/libc.so.6
#1 0x000000365223400d in abort () from /lib64/libc.so.6
#2 0x000000365222ba1e in __assert_fail_base () from /lib64/libc.so.6
#3 0x000000365222bae0 in __assert_fail () from /lib64/libc.so.6
#4 0x00000000005c595f in getBufferHandle (this=Unhandled dwarf expression opcode 0xf3) at RayPlatform/memory/RingAllocator.cpp:265
#5 getBufferHandle (this=Unhandled dwarf expression opcode 0xf3) at RayPlatform/memory/RingAllocator.cpp:487
#6 markBufferAsDirty (this=Unhandled dwarf expression opcode 0xf3) at RayPlatform/memory/RingAllocator.cpp:234
#7 RingAllocator::registerBuffer (this=Unhandled dwarf expression opcode 0xf3) at RayPlatform/memory/RingAllocator.cpp:526
#8 0x00000000005e3d8f in registerMessageBuffer (this=0x7fff2faa8d38, outbox=0x2b962586d208, outboxBufferAllocator=0x2b962586d220, miniRanksPerRank=3)
at RayPlatform/communication/MessagesHandler.cpp:976
#9 MessagesHandler::sendMessagesForMiniRank (this=0x7fff2faa8d38, outbox=0x2b962586d208, outboxBufferAllocator=0x2b962586d220, miniRanksPerRank=3)
at RayPlatform/communication/MessagesHandler.cpp:228
#10 0x00000000005e41d4 in MessagesHandler::sendAndReceiveMessagesForRankProcess (this=0x7fff2faa8d38, cores=0x7fff2faaa140, miniRanksPerRank=3, communicate=0x7fff2faa8d30)
at RayPlatform/communication/MessagesHandler.cpp:77
#11 0x0000000000487db4 in startMiniRanks (this=0x7fff2faa8d20) at RayPlatform/RayPlatform/core/RankProcess.h:288
#12 RankProcess::run (this=0x7fff2faa8d20) at RayPlatform/RayPlatform/core/RankProcess.h:232
#13 0x0000000000488107 in main (argc=8, argv=0x7fff2faaa448) at code/application_core/ray_main.cpp:32

@sebhtml
Copy link
Owner

sebhtml commented Nov 18, 2013

Hi,

I (probably) found the issue.

In the message handler code, this was used to register the dirty buffer:

request = this->registerMessageBuffer(buffer, m_rank, destination,
tag, outboxBufferAllocator);

However, buffer is not thread-safe.

Working on a patch now.

sebhtml pushed a commit to sebhtml/RayPlatform that referenced this issue Nov 19, 2013
This patch does not yet fix the issue though.

Link: sebhtml/ray#220
Reported-by: Bastien Chevreux <[email protected]>
Signed-off-by: Sébastien Boisvert <[email protected]>
@sebhtml
Copy link
Owner

sebhtml commented Nov 19, 2013

3 Mixed messages with tag 17:

Ray: RayPlatform/communication/Message.cpp:533: void Message::runAssertions(int, bool, bool): Assertion m_routingSource == -1' failed. [cp2035:15087] *** Process received signal *** [cp2035:15087] Signal: Aborted (6) [cp2035:15087] Signal code: (-6) Ray: RayPlatform/communication/Message.cpp:533: void Message::runAssertions(int, bool, bool): Assertionm_routingSource == -1' failed.
Source: 0 Destination: 0 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActor: 0 RoutingSource: 0 RoutingDestination: 0 MiniRankSource: 0 MiniRankDestination: 0 Buffer: 0x2b5a88d96160 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00Source: 0 Destination: 1 RealTag: 17Source: 0x000 Destination: 2 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActor: 0 RoutingSource: 0 RoutingDestination: 0 MiniRankSource: 0x00 0x00 0x00 Count: 0x00 0x005 0x00 Overlay: 0x00 0x00021 0x00 Bytes: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

44 MiniRankDestination: SourceActor: 00 Buffer: DestinationActor: 0x2b5a88f939400 with RoutingSource: 44 bytes : 0 RoutingDestination: 0 MiniRankSource: 0 0x15 0x00 0x00 0x00 0x00 0x00 0x00 MiniRankDestination: 0 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x44 MiniRankDestination: SourceActor: 00 Buffer: DestinationActor: 0x2b5a88f939400 with RoutingSource: 44 bytes : 0 RoutingDestination: 0 MiniRankSource: 0 0x15 0x00 0x00 0x00 0x00 0x00 0x00 MiniRankDestination: 0 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Buffer: 0x1213ca0 with 44 bytes : 0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Original messages:

[Communication] 27 microseconds, SEND Source: 0 Destination: 0 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo
r: 0 RoutingSource: -1 RoutingDestination: -1 MiniRankSource: 0 MiniRankDestination: 0 Buffer: 0x2b5a98f67000 with 44 bytes : 0x15 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff
0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

[Communication] 27 microseconds, SEND Source: 0 Destination: 1 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo
r: 1 RoutingSource: -1 RoutingDestination: -1 MiniRankSource: 0 MiniRankDestination: 1 Buffer: 0x2b5a98f67fc0 with 44 bytes : 0x15 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff
0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00

[Communication] 27 microseconds, SEND Source: 0 Destination: 2 RealTag: 17 Count: 5 Overlay: 21 Bytes: 44 SourceActor: 0 DestinationActo
r: 2 RoutingSource: -1 RoutingDestination: -1 MiniRankSource: 0 MiniRankDestination: 2 Buffer: 0x2b5a98f68f80 with 44 bytes : 0x15 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0xff 0xff 0xff 0xff 0xff
0xff 0xff 0xff 0x00 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Original message buffer (analysis) from 0 to 2:

Message has 44 bytes
header is always 28 bytes
data: 16 bytes (first)
AMD Opteron is Little Endian too!

0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 // kmer length is 21
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 // application data (unknown)
0x00 0x00 0x00 0x00 // source actor
0x02 0x00 0x00 0x00 // destination actor
0xff 0xff 0xff 0xff // routing source (-1)
0xff 0xff 0xff 0xff // routing destination (-1)
0x00 0x00 0x00 0x00 // minirank source (0)
0x02 0x00 0x00 0x00 // minirank destination (2)
0x00 0x00 0x00 0x00 // CRC32 (not active)

Received message:

44 bytes:
(no header)
0x15 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00

sebhtml pushed a commit to sebhtml/RayPlatform that referenced this issue Nov 21, 2013
Link: sebhtml/ray#220
Reported-by: Bastien Chevreux <[email protected]>
Signed-off-by: Sébastien Boisvert <[email protected]>
@sebhtml
Copy link
Owner

sebhtml commented Nov 21, 2013

I think this is the only related issue:

Rank 1 registered its seeds
VirtualProcessor: completed jobs: 0
Rank 1 : VirtualCommunicator (service provided by VirtualCommunicator): 0 virtual messages generated 0 real messages (VirtualProcessor: completed jobs: 0VirtualProcessor: completed jobs:
Rank 2 : VirtualCommunicator (service provided by VirtualCommunicator): 0 virtual messages generated 0 real messages (0
Rank 0 : VirtualCommunicator (service provided by VirtualCommunicator): 00 virtual messages generated 0 real messages (%)
0%)
0%)
Rank 0 freed 20971520 bytes from the path memory pool (chunks: 5)
Rank 1 freed 20971520 bytes from the path memory pool (chunks: 5)
Rank 2 freed 20971520 bytes from the path memory pool (chunks: 5)
Ray: RayPlatform/communication/Message.cpp:499: void Message::runAssertions(int, bool, bool): Assertion m_destination < size' failed. Ray: RayPlatform/communication/Message.cpp:499: void Message::runAssertions(int, bool, bool): Assertionm_destination < size' failed.
Ray: RayPlatform/communication/Message.cpp:499: void Message::runAssertions(int, bool, bool): Assertion `m_destination < size' failed.

Related?: #222

@sebhtml
Copy link
Owner

sebhtml commented Nov 21, 2013

Tag is 224.

(gdb) bt
#0 0x0000003fa48328a5 in raise () from /lib64/libc.so.6
#1 0x0000003fa4834085 in abort () from /lib64/libc.so.6
#2 0x0000003fa482ba1e in assert_fail_base () from /lib64/libc.so.6
#3 0x0000003fa482bae0 in __assert_fail () from /lib64/libc.so.6
#4 0x00000000005e3c1f in Message::runAssertions (this=0x2b1ceae91860, size=3, routing=Unhandled dwarf expression opcode 0xf3
) at RayPlatform/communication/Message.cpp:499
#5 0x00000000005e9334 in testMessage (this=0x2b1ce7260048) at RayPlatform/core/ComputeCore.cpp:2389
#6 ComputeCore::sendMessages (this=0x2b1ce7260048) at RayPlatform/core/ComputeCore.cpp:807
#7 0x00000000005ed6d9 in ComputeCore::runWithProfiler (this=0x2b1ce7260048) at RayPlatform/core/ComputeCore.cpp:555
#8 0x00000000005eea48 in ComputeCore::run (this=0x2b1ce7260048) at RayPlatform/core/ComputeCore.cpp:198
#9 0x000000000048a4bc in Machine::start (this=0x2b1ce7260040) at code/application_core/Machine.cpp:560
#10 0x00000000004841e9 in Rank_startMiniRank (object=0x2b1ce7260040) at RayPlatform/RayPlatform/core/RankProcess.h:208
#11 0x0000003fa4c07851 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003fa48e890d in clone () from /lib64/libc.so.6
(gdb) f 4
#4 0x00000000005e3c1f in Message::runAssertions (this=0x2b1ceae91860, size=3, routing=Unhandled dwarf expression opcode 0xf3
) at RayPlatform/communication/Message.cpp:499
499 assert(m_destination < size);
(gdb) info locals
__PRETTY_FUNCTION
= "void Message::runAssertions(int, bool, bool)"
(gdb) p this->m_buffer
$1 = (void *) 0x2b1cfafbbac0
(gdb) p this->m_bytes
$2 = 28
(gdb) p this->m_destination
$3 = 3
(gdb) p this->m_source
$4 = 0
(gdb) p this->m_tag
$5 = 224
(gdb) p this->m_miniRankSource
$6 = 0
(gdb) p this->m_miniRankDestination
$7 = 3
(gdb) quit

@sebhtml
Copy link
Owner

sebhtml commented Nov 22, 2013

@sebhtml sebhtml closed this as completed Nov 22, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants