Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing example crashes with bad alloc due to small stack size #67

Closed
jellehierck opened this issue Feb 16, 2024 · 13 comments
Closed

Tracing example crashes with bad alloc due to small stack size #67

jellehierck opened this issue Feb 16, 2024 · 13 comments

Comments

@jellehierck
Copy link

jellehierck commented Feb 16, 2024

When running the tracing_example, my program crashes when a trace session is stopped with the following error message:

terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

The crash happens on line line 87: app.StopTraceSession(); When using the debugger, I found out that the program crashes when the trace aggregator thread is joined in app.cc:218.

After the crash, a data1.perfetto file is created but only contains ~30 loops for me. This made me suspect that the crash might be due to a small stack size as you mentioned in your blog (part 4).

Note that this crash also occurs when I try to implement tracing in my own code, i.e. not just the tracing_example program.

Fix: increase thread stack size

Edit: this "fix" might make the problem occur less often, but not disappear (see this comment).

The default stack size set in ThreadConfig is 8 MB. Increasing the stack size to 16 MB caused fixed the crash for me. I did this by adding the following line to the thread_config section in tracing_example/main.cc:64:

int main() {
  cactus_rt::CyclicThreadConfig thread_config;
  thread_config.period_ns = 1'000'000;
  thread_config.cpu_affinity = std::vector<size_t>{2};
  thread_config.SetFifoScheduler(80);
  thread_config.stack_size = 16 * 1024 * 1024;  // <--- set larger stack size

  ...

After this fix, the program does not crash for me anymore and a correct data1.perfetto file is created with the entire trace.

@shuhaowu
Copy link
Contributor

Thanks for the detailed report. We'll take a look once we get a bit of time.

@shuhaowu
Copy link
Contributor

What platform are you trying this on? Arm64 or x86? What's the procedure you used to reproduce this? I can't seem to reproduce this problem on my setup. IIRC the trace data is not stored on the stack, so I'm not sure why increasing the stack size would help you.

@jellehierck
Copy link
Author

I am running on Linux 20.04, kernel version 5.15.129-rt67, with x86 architecture.

To reproduce I clone the project, build it in Release and then run the tracing example:

make release
./build/release/examples/tracing_example/rt_tracing_example

The program crashes after 5 seconds, which is when the first trace is written to a file.

When I apply the "fix" I mentioned above, I now get the terminate called after throwing an instance of 'St9bad_alloc' error after 15 seconds, i.e. when the second trace is written.

This is interesting, and indeed shows that the increased stack size does not fix the problem. I have updated the original question to reflect this.

I will try to do some more testing on my end. If you want me to run specific tests or benchmarks, let me know.

@jellehierck
Copy link
Author

I also repeated this on another machine running Linux 20.04, kernel version 5.15.137-rt71, with x86 architecture.

On this machine, leaving the stack size unchanged resulted in a crash after 5 seconds. Adjusting the stack size let the program finish without problems. Could the kernel version be the culprit?

I will try this on a third machine running another version of PREEMPT_RT next week.

@shuhaowu
Copy link
Contributor

Interesting. I don't have a 20.04 and 5.15 kernel to test. If you can get a core dump that might also be helpful.

@jellehierck
Copy link
Author

jellehierck commented Mar 11, 2024

Coredump of tracing example built in Debug running on my 20.04 kernel 5.15.137-rt71 machine:

git clone https://github.com/cactusdynamics/cactus-rt.git
cd cactus-rt
make debug
./build/debug/examples/tracing_example/rt_tracing_example
Coredump
jelle@jelle-laptop-ubuntu:~/temp/cactus-rt$ coredumpctl info 
           PID: 10698 (rt_tracing_exam)
           UID: 1000 (jelle)
           GID: 1000 (jelle)
        Signal: 6 (ABRT)
     Timestamp: Mon 2024-03-11 09:48:21 CET (1min 10s ago)
  Command Line: ./build/debug/examples/tracing_example/rt_tracing_example
    Executable: /home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example
 Control Group: /user.slice/user-1000.slice/[email protected]/vte-spawn-0ed22e58-60c8-44a3-8457-09a6d754ca33.scope
          Unit: [email protected]
     User Unit: vte-spawn-0ed22e58-60c8-44a3-8457-09a6d754ca33.scope
         Slice: user-1000.slice
     Owner UID: 1000 (jelle)
       Boot ID: <redacted>
    Machine ID: <redacted>
      Hostname: jelle-laptop-ubuntu
       Storage: /var/lib/systemd/coredump/core.rt_tracing_exam.1000.46b5967af75b46689df7204525afeebe.10698.1710146901000000000000.lz4
       Message: Process 10698 (rt_tracing_exam) of user 1000 dumped core.
                
                Stack trace of thread 10699:
                #0  0x00007f427175700b __GI_raise (libc.so.6 + 0x4300b)
                #1  0x00007f4271736859 __GI_abort (libc.so.6 + 0x22859)
                #2  0x00007f4271b318d1 n/a (libstdc++.so.6 + 0x9e8d1)
                #3  0x00007f4271b3d37c n/a (libstdc++.so.6 + 0xaa37c)
                #4  0x00007f4271b3d3e7 _ZSt9terminatev (libstdc++.so.6 + 0xaa3e7)
                #5  0x00007f4271b3d699 __cxa_throw (libstdc++.so.6 + 0xaa699)
                #6  0x00007f4271b314e2 n/a (libstdc++.so.6 + 0x9e4e2)
                #7  0x000055de2e8193a1 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x413a1)
                #8  0x000055de2e818c12 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x40c12)
                #9  0x000055de2e81c6cc n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x446cc)
                #10 0x000055de2e81c5fa n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x445fa)
                #11 0x000055de2e81c559 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x44559)
                #12 0x000055de2e81c510 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x44510)
                #13 0x000055de2e81c4f0 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x444f0)
                #14 0x00007f4271b69df4 n/a (libstdc++.so.6 + 0xd6df4)
                #15 0x00007f427190e609 start_thread (libpthread.so.0 + 0x8609)
                #16 0x00007f4271833353 __clone (libc.so.6 + 0x11f353)
                
                Stack trace of thread 10700:
                #0  0x00007f427193ba97 n/a (libgcc_s.so.1 + 0x12a97)
                #1  0x00007f42718735f5 __GI___dl_iterate_phdr (libc.so.6 + 0x15f5f5)
                #2  0x00007f427193c6d1 _Unwind_Find_FDE (libgcc_s.so.1 + 0x136d1)
                #3  0x00007f4271938878 n/a (libgcc_s.so.1 + 0xf878)
                #4  0x00007f427193a09b _Unwind_RaiseException (libgcc_s.so.1 + 0x1109b)
                #5  0x00007f4271b3d68c __cxa_throw (libstdc++.so.6 + 0xaa68c)
                #6  0x00007f4271b314e2 n/a (libstdc++.so.6 + 0x9e4e2)
                #7  0x000055de2e831b7a n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x59b7a)
                #8  0x000055de2e83102f n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x5902f)
                #9  0x000055de2e8304e8 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x584e8)
                #10 0x000055de2e82f8ba n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x578ba)
                #11 0x000055de2e82f01a n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x5701a)
                #12 0x000055de2e82dfca n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x55fca)
                #13 0x000055de2e8002b5 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x282b5)
                #14 0x000055de2e7fb182 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x23182)
                #15 0x000055de2e8122eb n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a2eb)
                #16 0x000055de2e812244 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a244)
                #17 0x000055de2e81218e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a18e)
                #18 0x000055de2e8120c2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a0c2)
                #19 0x000055de2e811d12 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x39d12)
                #20 0x00007f4271b69df4 n/a (libstdc++.so.6 + 0xd6df4)
                #21 0x00007f427190e609 start_thread (libpthread.so.0 + 0x8609)
                #22 0x00007f4271833353 __clone (libc.so.6 + 0x11f353)
                
                Stack trace of thread 10698:
                #0  0x00007f427190fcd7 __pthread_clockjoin_ex (libpthread.so.0 + 0x9cd7)
                #1  0x00007f4271b6a057 _ZNSt6thread4joinEv (libstdc++.so.6 + 0xd7057)
                #2  0x000055de2e818102 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x40102)
                #3  0x000055de2e7f98c2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x218c2)
                #4  0x000055de2e7f901e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x2101e)
                #5  0x000055de2e7eae40 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12e40)
                #6  0x00007f4271738083 __libc_start_main (libc.so.6 + 0x24083)
                #7  0x000055de2e7eaa7e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12a7e)
lines 46-71/71 (END)

@jellehierck
Copy link
Author

jellehierck commented Mar 11, 2024

I just noticed that you merged #70. The coredump in my comment above was created before I pulled these changes (i.e. cce7512).

The same bad alloc error occurs after pulling the newest changes. Here is a coredump when running the latest changes.

Coredump ``` jelle@jelle-laptop-ubuntu:~/temp/cactus-rt$ coredumpctl info PID: 12625 (rt_tracing_exam) UID: 1000 (jelle) GID: 1000 (jelle) Signal: 6 (ABRT) Timestamp: Mon 2024-03-11 10:16:21 CET (27s ago) Command Line: ./build/debug/examples/tracing_example/rt_tracing_example Executable: /home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example Control Group: /user.slice/user-1000.slice/[email protected]/vte-spawn-0ed22e58-60c8-44a3-8457-09a6d754ca33.scope Unit: [email protected] User Unit: vte-spawn-0ed22e58-60c8-44a3-8457-09a6d754ca33.scope Slice: user-1000.slice Owner UID: 1000 (jelle) Boot ID: Machine ID: Hostname: jelle-laptop-ubuntu Storage: /var/lib/systemd/coredump/core.rt_tracing_exam.1000.46b5967af75b46689df7204525afeebe.12625.1710148581000000000000.lz4 Message: Process 12625 (rt_tracing_exam) of user 1000 dumped core.
            Stack trace of thread 12626:
            #0  0x00007f8fe40b100b __GI_raise (libc.so.6 + 0x4300b)
            #1  0x00007f8fe4090859 __GI_abort (libc.so.6 + 0x22859)
            #2  0x00007f8fe448b8d1 n/a (libstdc++.so.6 + 0x9e8d1)
            #3  0x00007f8fe449737c n/a (libstdc++.so.6 + 0xaa37c)
            #4  0x00007f8fe4496359 n/a (libstdc++.so.6 + 0xa9359)
            #5  0x00007f8fe4496d11 __gxx_personality_v0 (libstdc++.so.6 + 0xa9d11)
            #6  0x00007f8fe4293bff n/a (libgcc_s.so.1 + 0x10bff)
            #7  0x00007f8fe4294291 _Unwind_RaiseException (libgcc_s.so.1 + 0x11291)
            #8  0x00007f8fe449768c __cxa_throw (libstdc++.so.6 + 0xaa68c)
            #9  0x00007f8fe448b4e2 n/a (libstdc++.so.6 + 0x9e4e2)
            #10 0x00007f8fe46da418 _ZN6google8protobuf8internal20RepeatedPtrFieldBase14InternalExtendEi (libprotobuf.so.17 + 0x10b418)
            #11 0x00005639826ed4a9 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x434a9)
            #12 0x00005639826ec85b n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x4285b)
            #13 0x00005639826ebacd n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x41acd)
            #14 0x00005639826eb2dd n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x412dd)
            #15 0x00005639826ea656 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x40656)
            #16 0x00005639826ea1e4 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x401e4)
            #17 0x00005639826ee582 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x44582)
            #18 0x00005639826ee4b0 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x444b0)
            #19 0x00005639826ee40f n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x4440f)
            #20 0x00005639826ee3c6 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x443c6)
            #21 0x00005639826ee3a6 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x443a6)
            #22 0x00007f8fe44c3df4 n/a (libstdc++.so.6 + 0xd6df4)
            #23 0x00007f8fe4268609 start_thread (libpthread.so.0 + 0x8609)
            #24 0x00007f8fe418d353 __clone (libc.so.6 + 0x11f353)
            
            Stack trace of thread 12627:
            #0  0x00007f8fe426f7d1 futex_abstimed_wait_cancelable (libpthread.so.0 + 0xf7d1)
            #1  0x00005639826cbb00 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x21b00)
            #2  0x00005639826df94a n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3594a)
            #3  0x00005639826dd9f2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x339f2)
            #4  0x00005639826da59a n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3059a)
            #5  0x00005639826d6941 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x2c941)
            #6  0x00005639826d24bf n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x284bf)
            #7  0x00005639826cd182 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x23182)
            #8  0x00005639826e42eb n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a2eb)
            #9  0x00005639826e4244 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a244)
            #10 0x00005639826e418e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a18e)
            #11 0x00005639826e40c2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3a0c2)
            #12 0x00005639826e3d12 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x39d12)
            #13 0x00007f8fe44c3df4 n/a (libstdc++.so.6 + 0xd6df4)
            #14 0x00007f8fe4268609 start_thread (libpthread.so.0 + 0x8609)
            #15 0x00007f8fe418d353 __clone (libc.so.6 + 0x11f353)
            
            Stack trace of thread 12628:
            #0  0x00007f8fe414b090 __GI___clock_gettime (libc.so.6 + 0xdd090)
            #1  0x00005639826bf1a1 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x151a1)
            #2  0x00005639826bcb8d n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12b8d)
            #3  0x00005639826bffbe n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x15fbe)
            #4  0x00005639826bfd34 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x15d34)
            #5  0x00005639826e6098 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3c098)
            #6  0x00005639826e4c51 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x3ac51)
            #7  0x00007f8fe4268609 start_thread (libpthread.so.0 + 0x8609)
            #8  0x00007f8fe418d353 __clone (libc.so.6 + 0x11f353)
            
            Stack trace of thread 12625:
            #0  0x00007f8fe4269cd7 __pthread_clockjoin_ex (libpthread.so.0 + 0x9cd7)
            #1  0x00007f8fe44c4057 _ZNSt6thread4joinEv (libstdc++.so.6 + 0xd7057)
            #2  0x00005639826ea102 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x40102)
            #3  0x00005639826cb8c2 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x218c2)
            #4  0x00005639826cb01e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x2101e)
            #5  0x00005639826bce40 n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12e40)
            #6  0x00007f8fe4092083 __libc_start_main (libc.so.6 + 0x24083)
            #7  0x00005639826bca7e n/a (/home/jelle/temp/cactus-rt/build/debug/examples/tracing_example/rt_tracing_example + 0x12a7e)
</details>

@shuhaowu
Copy link
Contributor

I see in your latest coredump there is a line: #10 0x00007f8fe46da418 _ZN6google8protobuf8internal20RepeatedPtrFieldBase14InternalExtendEi (libprotobuf.so.17 + 0x10b418).

I'm wondering if this is a protobuf bug somewhere because 20.04 is quite old, and cactus-rt currently links with the system-level protobuf. My version:

$ apt list --installed | grep libprotobuf

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libprotobuf-dev/jammy-updates,jammy-security,now 3.12.4-1ubuntu7.22.04.1 amd64 [installed,automatic]
libprotobuf-lite23/jammy-updates,jammy-security,now 3.12.4-1ubuntu7.22.04.1 amd64 [installed,automatic]
libprotobuf23/jammy-updates,jammy-security,now 3.12.4-1ubuntu7.22.04.1 amd64 [installed,automatic]

@jellehierck
Copy link
Author

$ apt list --installed | grep libprotobuf

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libprotobuf-dev/focal-security,focal-updates,now 3.6.1.3-2ubuntu5.2 amd64 [installed,automatic]
libprotobuf-lite17/focal-security,focal-updates,now 3.6.1.3-2ubuntu5.2 amd64 [installed,automatic]
libprotobuf17/focal-security,focal-updates,now 3.6.1.3-2ubuntu5.2 amd64 [installed,automatic]

So yes, it is indeed a few versions behind.

@shuhaowu
Copy link
Contributor

Hard to say if that is the problem, I've also created a PR that would check the header compiled against the actual installed library, in case they are different (which could cause segfaults): #75. Maybe you can try it to make sure nothing wrong is happening..?

@jellehierck
Copy link
Author

Adding GOOGLE_PROTOBUF_VERIFY_VERSION; to app.cc and running the rt_tracing_example again did not cause the program to abort, so the header and library versions seem to match at least.

I also checked /usr/include/google/protobuf/stubs/common.h (where GOOGLE_PROTOBUF_VERIFY_VERSION is defined) and saw that the library version is indeed 3.6.1 (#define GOOGLE_PROTOBUF_MIN_LIBRARY_VERSION 3006001), so the headers match the libraries.

@shuhaowu
Copy link
Contributor

shuhaowu commented Aug 2, 2024

Did you ever get closer to solving this problem?

@jellehierck
Copy link
Author

Sorry for letting this issue go stale, the project was discontinued for other reasons so I did not take a look again. Since then you have made quite some changes to the tracer so I am not sure if this issue still persists.

I will start to use cactus_rt in a new project, and reopen this issue if I come across it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants