automatic batch size detection leads to system crash on Jetson AGX Orin Developer Kit (64GB DRAM) #408

litongda007 · 2023-10-12T05:42:00Z

Dear all,

I have been testing dorado on a Jetson AGX Orin Developer Kit (64GB DRAM). Whenever I am using automatic batch size detection, it leads to system crash. Based on the memory usage reported by jtop, system crash happened when the system memory was almost full. Both dorado v0.3.4 and v.0.4.0 have this issue. Like Apple Silicon, Jetson platforms has unified memory structure. The Ampure GPU and Arm Cortex-A78AE CPU share the same 64GB LPDDR5 RAM. I was wondering if dorado is still not optimised for this case.

Thanks.

vellamike · 2023-10-17T15:44:54Z

Hi @litongda007 ,

Sorry for the delay in replying to you.

The issue as you correctly identified is that this system uses unified memory, the auto-batch size feature for Nvidia devices selects the batch size which gives the best performance, which might be a batch size which uses a very high proportion of memory. On discrete GPU systems this is not an issue but on the Orin Dorado is eating up all the host memory.

We will get a fix in for this in an upcoming release, in the meantime your only solution is to manually select a batch size with the -b flag.

tijyojwad · 2023-11-15T17:56:28Z

Hi @litongda007 - we've addressed this in dorado v0.4.3 released yesterday. Dorado should now use at most half of the available unified memory for GPU basecalling. Please let us know if you still run into issues. Thank you!

litongda007 · 2023-11-17T06:40:22Z

Hi @litongda007 - we've addressed this in dorado v0.4.3 released yesterday. Dorado should now use at most half of the available unified memory for GPU basecalling. Please let us know if you still run into issues. Thank you!

Hi @tijyojwad,

Thanks for keeping me updated with this issue. I just tested it and this issue is partially solved. Specifically, when the SUP model was used, this issue has been fixed. However, when the HAC model was used, this issue is still there. Dorado automatically identified ~50 GB RAM for processing, and tried to only use half of it. The automatic batch size selection identified the max being ~3900 and the system crashed when testing different batch size (crashed at ~3600).

Moreover, would it be possible to manually set a memory percentage limit instead of always using 50%?

Thanks again for your time and help.

tijyojwad · 2023-11-17T14:38:56Z

Hi @litongda007 - thanks for testing the new build! Can you post the logs of what you see? When you say it crashed did it run OOM? did it throw any exception or was it a seg fault?

Also, is dorado the only application running on the system?

litongda007 · 2023-11-28T03:34:05Z

Hi @tijyojwad - thanks for following up on this issue. Sorry for my late reply, I have been travelling for a conference in the past week.

Dorado was the only application running on the system. When the SUP model was used, dorado identified 54.68GB available memory, and limited the usage to 26.34GB (please see the attached image). The final batch size was 1024 using ~21GB memory.

When the HAC model was used, dorado identified 54.76GB available memory, and limited the usage to 26.38GB (please see the attached image). However, the actual memory usage was more than that while it was testing different batch sizes, and eventually led to OOM and system crush when the batch size was around 3600. There was no exception or seg fault. The entire system became unresponsive, and I had to power cycle it.

Thanks.

tijyojwad · 2024-02-06T23:28:42Z

Hi @litongda007 - I'm so sorry for the super delayed response to this. It fell off my radar.

Auto batch size should only test batch sizes up to what will fit in the allocated memory. In this case it looks like it went over though. We'll take a look to see what's going on.

vellamike added the bug Something isn't working label Oct 17, 2023

tijyojwad closed this as completed Nov 15, 2023

tijyojwad reopened this Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatic batch size detection leads to system crash on Jetson AGX Orin Developer Kit (64GB DRAM) #408

automatic batch size detection leads to system crash on Jetson AGX Orin Developer Kit (64GB DRAM) #408

litongda007 commented Oct 12, 2023

vellamike commented Oct 17, 2023

tijyojwad commented Nov 15, 2023

litongda007 commented Nov 17, 2023

tijyojwad commented Nov 17, 2023

litongda007 commented Nov 28, 2023

tijyojwad commented Feb 6, 2024 •

edited

Loading

automatic batch size detection leads to system crash on Jetson AGX Orin Developer Kit (64GB DRAM) #408

automatic batch size detection leads to system crash on Jetson AGX Orin Developer Kit (64GB DRAM) #408

Comments

litongda007 commented Oct 12, 2023

vellamike commented Oct 17, 2023

tijyojwad commented Nov 15, 2023

litongda007 commented Nov 17, 2023

tijyojwad commented Nov 17, 2023

litongda007 commented Nov 28, 2023

tijyojwad commented Feb 6, 2024 • edited Loading

tijyojwad commented Feb 6, 2024 •

edited

Loading