-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nvidia L4 GPU #459
Comments
The reason the H100 and A100 are a lot faster for Dorado is mainly there ability to run INT8 operations during basecalling. We don't use fp64 anywhere on the GPU. L4 appears to have decent specs, but Dorado isn't currently optimised for it (it will fall back to fp16 instead of INT8). Depending on the price, you may find L4 to be a more cost-effective option, it's unlikely you will be able to infer the performance from the specs though - your best bet is to benchmark. Mike |
Dear Mike, L4's spec says it supports Tensor Core INT8: |
That's correct, L4 does support int8, however due to other features of this
GPU the current implementation of these kernels in Dorado will not leverage
int8 on these devices and use fp16 instead.
…On Thu, 9 Nov 2023 at 07:37, ymcki ***@***.***> wrote:
Dear Mike, L4's spec says it supports Tensor Core INT8:
https://www.nvidia.com/en-us/data-center/l4/
What do you mean by "it will fall back to FP16"?
—
Reply to this email directly, view it on GitHub
<#459 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALYB7K37EVWPLOGYHCTBR3YDSBZ5AVCNFSM6AAAAAA7CAUX7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBTGMYDANRYHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Dear Mike, thanks for your follow up. My understanding is that L4's architecture is AD104 whereas A100/H100 are GA100/GH100. So if dorado is optimized for A100/H100, then it should also be optimized for A800/H800. Presumably, it should also be optimized and run on INT8 for A30 as it is also an GA100 card but of course with slower VRAM. Is that correct? |
Will the L40S also use fp16 instead of int8 kernels? (Is this something that could be fixed in a future update or, like @ymcki asked, is it due to intrinsic hardware limitations?) |
Hi, apologies for the delay in replying to your follow up questions.
Essentially the shared cache of the AD104 is smaller than for the GA100/GH100 (128KB vc 164KB) , which means that our INT8 kernels currently cannot execute on L4, this will also apply to L40S. Adding support for INT8 to these lower-SMEM GPUs is something we are investigating, so it is likely Dorado will be faster on these GPUs in the future. |
Thanks for your reply. Based on my research, L1 shared memory seems to be defined by compute capability version. Here is a list of compute capabilities and their L1 shared memory size that also has Tensor Core INT8 support: Shared Memory Compute Capability Apparently, it will be a great news for limited budget labs if 100KB can also be supported. |
On paper, the GA100 A30 GPU's performance is about half the performance of A100. I remember someone said without the INT8 support, performance will be halved. Since A30 also has 164KB SMEM, so I So for the time being, we can expect A30 to have similar performance to L40 such that it can be a good choice for limited budge labs? |
We're in conversations with vendors and leaning towards L40s GPUs to serve some of our user base and dorado is heavily used on our site. We'd be interested to know how these investigations are going as, depending on the progress, we may pick up the L40s systems in the hopes that their dorado performance will improve Cheers |
We also ended up putting in an order for a bunch of L40S's, and hope that INT8 kernel support is added soon. |
I’m split between using A6000 Ada generation and L40S.
Does anyone know of a comprehensive dorado benchmark for the various cards? It’d be really nice to have.
… On Jan 29, 2024, at 05:38, Jacob Quinn Shenker ***@***.***> wrote:
We also ended up putting in an order for a bunch of L40S's, and hope that INT8 kernel support is added soon.
—
Reply to this email directly, view it on GitHub <#459 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMXPRTEWXYVJWCDYBWKHGLYQ6Q4PAVCNFSM6AAAAAA7CAUX7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJUG4YTGNRTGE>.
You are receiving this because you are subscribed to this thread.
|
It's not comprehensive but there are some benchmarks at: |
Thanks; this is really helpful!
Which A100 is it? The older 40 GB one or the newer 80GB version?
Does anyone have benchmarks for an RTX 6000 Ada? Those come with 48 GB and I’m hoping they’ll work well.
Thanks again.
… On Jan 29, 2024, at 17:14, grendeloz ***@***.***> wrote:
It's not comprehensive but there are some benchmarks at:
https://github.com/quadram-institute-bioscience/dorado-gpu-benchmarking
—
Reply to this email directly, view it on GitHub <#459 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMXPRTLIWRKIUO62UQHCDDYRBCOZAVCNFSM6AAAAAA7CAUX7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJVHA3TMNRWGQ>.
You are receiving this because you commented.
|
Hi All,
I have read through a lot of tickets here. I understand the software is optimised for A100/H100's however they are very costly. I am looking to build a single server for a research team and the new L4 hits the right price point and gives good performance, particularly for single precision (float). The reason the H100 and A100 could be a lot faster is their ability to do double precision calculations.
For example
L4 = 30,300 GFLOPS single (float), or 490 double GFLOPS - that's really low on the doubles.
H100 = 51,200 GLOPS single (float), or 25,600 double GFLOPS - much higher ratio on the doubles.
Looking at the code there seems to be a lot of usage of the type float, however I do see double being used in some areas.
Can anyone give any advise on this particular area of GPU selection?
The text was updated successfully, but these errors were encountered: