Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic batch size detection leads to system crash on Jetson AGX Orin Developer Kit (64GB DRAM) #408

Open
litongda007 opened this issue Oct 12, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@litongda007
Copy link

Dear all,

I have been testing dorado on a Jetson AGX Orin Developer Kit (64GB DRAM). Whenever I am using automatic batch size detection, it leads to system crash. Based on the memory usage reported by jtop, system crash happened when the system memory was almost full. Both dorado v0.3.4 and v.0.4.0 have this issue. Like Apple Silicon, Jetson platforms has unified memory structure. The Ampure GPU and Arm Cortex-A78AE CPU share the same 64GB LPDDR5 RAM. I was wondering if dorado is still not optimised for this case.

Thanks.

@vellamike
Copy link
Collaborator

Hi @litongda007 ,

Sorry for the delay in replying to you.

The issue as you correctly identified is that this system uses unified memory, the auto-batch size feature for Nvidia devices selects the batch size which gives the best performance, which might be a batch size which uses a very high proportion of memory. On discrete GPU systems this is not an issue but on the Orin Dorado is eating up all the host memory.

We will get a fix in for this in an upcoming release, in the meantime your only solution is to manually select a batch size with the -b flag.

@vellamike vellamike added the bug Something isn't working label Oct 17, 2023
@tijyojwad
Copy link
Collaborator

Hi @litongda007 - we've addressed this in dorado v0.4.3 released yesterday. Dorado should now use at most half of the available unified memory for GPU basecalling. Please let us know if you still run into issues. Thank you!

@litongda007
Copy link
Author

Hi @litongda007 - we've addressed this in dorado v0.4.3 released yesterday. Dorado should now use at most half of the available unified memory for GPU basecalling. Please let us know if you still run into issues. Thank you!

Hi @tijyojwad,

Thanks for keeping me updated with this issue. I just tested it and this issue is partially solved. Specifically, when the SUP model was used, this issue has been fixed. However, when the HAC model was used, this issue is still there. Dorado automatically identified ~50 GB RAM for processing, and tried to only use half of it. The automatic batch size selection identified the max being ~3900 and the system crashed when testing different batch size (crashed at ~3600).

Moreover, would it be possible to manually set a memory percentage limit instead of always using 50%?

Thanks again for your time and help.

@tijyojwad tijyojwad reopened this Nov 17, 2023
@tijyojwad
Copy link
Collaborator

Hi @litongda007 - thanks for testing the new build! Can you post the logs of what you see? When you say it crashed did it run OOM? did it throw any exception or was it a seg fault?

Also, is dorado the only application running on the system?

@litongda007
Copy link
Author

Hi @tijyojwad - thanks for following up on this issue. Sorry for my late reply, I have been travelling for a conference in the past week.

Dorado was the only application running on the system. When the SUP model was used, dorado identified 54.68GB available memory, and limited the usage to 26.34GB (please see the attached image). The final batch size was 1024 using ~21GB memory.

SUP_auto_batch

When the HAC model was used, dorado identified 54.76GB available memory, and limited the usage to 26.38GB (please see the attached image). However, the actual memory usage was more than that while it was testing different batch sizes, and eventually led to OOM and system crush when the batch size was around 3600. There was no exception or seg fault. The entire system became unresponsive, and I had to power cycle it.

HAC_auto_batch

Thanks.

@tijyojwad
Copy link
Collaborator

tijyojwad commented Feb 6, 2024

Hi @litongda007 - I'm so sorry for the super delayed response to this. It fell off my radar.

Auto batch size should only test batch sizes up to what will fit in the allocated memory. In this case it looks like it went over though. We'll take a look to see what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants