-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
automatic batch size detection leads to system crash on Jetson AGX Orin Developer Kit (64GB DRAM) #408
Comments
Hi @litongda007 , Sorry for the delay in replying to you. The issue as you correctly identified is that this system uses unified memory, the auto-batch size feature for Nvidia devices selects the batch size which gives the best performance, which might be a batch size which uses a very high proportion of memory. On discrete GPU systems this is not an issue but on the Orin Dorado is eating up all the host memory. We will get a fix in for this in an upcoming release, in the meantime your only solution is to manually select a batch size with the |
Hi @litongda007 - we've addressed this in dorado v0.4.3 released yesterday. Dorado should now use at most half of the available unified memory for GPU basecalling. Please let us know if you still run into issues. Thank you! |
Hi @tijyojwad, Thanks for keeping me updated with this issue. I just tested it and this issue is partially solved. Specifically, when the SUP model was used, this issue has been fixed. However, when the HAC model was used, this issue is still there. Dorado automatically identified ~50 GB RAM for processing, and tried to only use half of it. The automatic batch size selection identified the max being ~3900 and the system crashed when testing different batch size (crashed at ~3600). Moreover, would it be possible to manually set a memory percentage limit instead of always using 50%? Thanks again for your time and help. |
Hi @litongda007 - thanks for testing the new build! Can you post the logs of what you see? When you say it Also, is dorado the only application running on the system? |
Hi @tijyojwad - thanks for following up on this issue. Sorry for my late reply, I have been travelling for a conference in the past week. Dorado was the only application running on the system. When the SUP model was used, dorado identified 54.68GB available memory, and limited the usage to 26.34GB (please see the attached image). The final batch size was 1024 using ~21GB memory. When the HAC model was used, dorado identified 54.76GB available memory, and limited the usage to 26.38GB (please see the attached image). However, the actual memory usage was more than that while it was testing different batch sizes, and eventually led to OOM and system crush when the batch size was around 3600. There was no exception or seg fault. The entire system became unresponsive, and I had to power cycle it. Thanks. |
Hi @litongda007 - I'm so sorry for the super delayed response to this. It fell off my radar. Auto batch size should only test batch sizes up to what will fit in the allocated memory. In this case it looks like it went over though. We'll take a look to see what's going on. |
Dear all,
I have been testing dorado on a Jetson AGX Orin Developer Kit (64GB DRAM). Whenever I am using automatic batch size detection, it leads to system crash. Based on the memory usage reported by jtop, system crash happened when the system memory was almost full. Both dorado v0.3.4 and v.0.4.0 have this issue. Like Apple Silicon, Jetson platforms has unified memory structure. The Ampure GPU and Arm Cortex-A78AE CPU share the same 64GB LPDDR5 RAM. I was wondering if dorado is still not optimised for this case.
Thanks.
The text was updated successfully, but these errors were encountered: