Nvidia L4 GPU #459

alexinthis · 2023-11-08T00:14:34Z

Hi All,

I have read through a lot of tickets here. I understand the software is optimised for A100/H100's however they are very costly. I am looking to build a single server for a research team and the new L4 hits the right price point and gives good performance, particularly for single precision (float). The reason the H100 and A100 could be a lot faster is their ability to do double precision calculations.

For example
L4 = 30,300 GFLOPS single (float), or 490 double GFLOPS - that's really low on the doubles.
H100 = 51,200 GLOPS single (float), or 25,600 double GFLOPS - much higher ratio on the doubles.

Looking at the code there seems to be a lot of usage of the type float, however I do see double being used in some areas.

Can anyone give any advise on this particular area of GPU selection?

vellamike · 2023-11-08T00:37:57Z

The reason the H100 and A100 are a lot faster for Dorado is mainly there ability to run INT8 operations during basecalling. We don't use fp64 anywhere on the GPU.

L4 appears to have decent specs, but Dorado isn't currently optimised for it (it will fall back to fp16 instead of INT8).

Depending on the price, you may find L4 to be a more cost-effective option, it's unlikely you will be able to infer the performance from the specs though - your best bet is to benchmark.

Mike

ymcki · 2023-11-09T07:36:50Z

Dear Mike, L4's spec says it supports Tensor Core INT8:
https://www.nvidia.com/en-us/data-center/l4/
What do you mean by "it will fall back to FP16"?

vellamike · 2023-11-09T10:36:54Z

That's correct, L4 does support int8, however due to other features of this GPU the current implementation of these kernels in Dorado will not leverage int8 on these devices and use fp16 instead.

…

On Thu, 9 Nov 2023 at 07:37, ymcki ***@***.***> wrote: Dear Mike, L4's spec says it supports Tensor Core INT8: https://www.nvidia.com/en-us/data-center/l4/ What do you mean by "it will fall back to FP16"? — Reply to this email directly, view it on GitHub <#459 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALYB7K37EVWPLOGYHCTBR3YDSBZ5AVCNFSM6AAAAAA7CAUX7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBTGMYDANRYHE> . You are receiving this because you commented.Message ID: ***@***.***>

ymcki · 2023-11-10T06:42:39Z

Dear Mike, thanks for your follow up.
Can you further clarify what you mean by "due to other features of this GPU"? Is this GPU missing some hardware feature that A100/H100 has but it has not? What is the name of this hardware feature?

My understanding is that L4's architecture is AD104 whereas A100/H100 are GA100/GH100. So if dorado is optimized for A100/H100, then it should also be optimized for A800/H800. Presumably, it should also be optimized and run on INT8 for A30 as it is also an GA100 card but of course with slower VRAM. Is that correct?

shenker · 2023-11-10T13:50:44Z

Will the L40S also use fp16 instead of int8 kernels? (Is this something that could be fixed in a future update or, like @ymcki asked, is it due to intrinsic hardware limitations?)

vellamike · 2023-12-04T18:20:04Z

Hi, apologies for the delay in replying to your follow up questions.

Can you further clarify what you mean by "due to other features of this GPU"? Is this GPU missing some hardware feature that A100/H100 has but it has not? What is the name of this hardware feature?

Essentially the shared cache of the AD104 is smaller than for the GA100/GH100 (128KB vc 164KB) , which means that our INT8 kernels currently cannot execute on L4, this will also apply to L40S.

Adding support for INT8 to these lower-SMEM GPUs is something we are investigating, so it is likely Dorado will be faster on these GPUs in the future.

ymcki · 2023-12-05T03:38:33Z

Thanks for your reply. Based on my research, L1 shared memory seems to be defined by compute capability version. Here is a list of compute capabilities and their L1 shared memory size that also has Tensor Core INT8 support:

Shared Memory Compute Capability
228KB 9.0
164KB 8.0
100KB 8.9
100KB 8.6
64KB 7.5

Apparently, it will be a great news for limited budget labs if 100KB can also be supported.

ymcki · 2023-12-05T03:45:36Z

On paper, the GA100 A30 GPU's performance is about half the performance of A100.

I remember someone said without the INT8 support, performance will be halved. Since A30 also has 164KB SMEM, so I
suppose its using INT8 all the time without performance penalty.

So for the time being, we can expect A30 to have similar performance to L40 such that it can be a good choice for limited budge labs?

CanWood · 2024-01-29T04:45:57Z

@vellamike :

Adding support for INT8 to these lower-SMEM GPUs is something we are investigating, so it is likely Dorado will be faster on these GPUs in the future.

We're in conversations with vendors and leaning towards L40s GPUs to serve some of our user base and dorado is heavily used on our site. We'd be interested to know how these investigations are going as, depending on the progress, we may pick up the L40s systems in the hopes that their dorado performance will improve

Cheers

shenker · 2024-01-29T13:38:03Z

We also ended up putting in an order for a bunch of L40S's, and hope that INT8 kernel support is added soon.

tnn111 · 2024-01-29T20:12:07Z

I’m split between using A6000 Ada generation and L40S. Does anyone know of a comprehensive dorado benchmark for the various cards? It’d be really nice to have.

…

On Jan 29, 2024, at 05:38, Jacob Quinn Shenker ***@***.***> wrote: We also ended up putting in an order for a bunch of L40S's, and hope that INT8 kernel support is added soon. — Reply to this email directly, view it on GitHub <#459 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMXPRTEWXYVJWCDYBWKHGLYQ6Q4PAVCNFSM6AAAAAA7CAUX7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJUG4YTGNRTGE>. You are receiving this because you are subscribed to this thread.

grendeloz · 2024-01-30T01:14:09Z

It's not comprehensive but there are some benchmarks at:
https://github.com/quadram-institute-bioscience/dorado-gpu-benchmarking

tnn111 · 2024-01-30T04:15:49Z

Thanks; this is really helpful! Which A100 is it? The older 40 GB one or the newer 80GB version? Does anyone have benchmarks for an RTX 6000 Ada? Those come with 48 GB and I’m hoping they’ll work well. Thanks again.

…

On Jan 29, 2024, at 17:14, grendeloz ***@***.***> wrote: It's not comprehensive but there are some benchmarks at: https://github.com/quadram-institute-bioscience/dorado-gpu-benchmarking — Reply to this email directly, view it on GitHub <#459 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMXPRTLIWRKIUO62UQHCDDYRBCOZAVCNFSM6AAAAAA7CAUX7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJVHA3TMNRWGQ>. You are receiving this because you commented.

ymcki mentioned this issue Feb 14, 2024

How to improve the reproducibility of dorado across different GPUs? #617

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia L4 GPU #459

Nvidia L4 GPU #459

alexinthis commented Nov 8, 2023

vellamike commented Nov 8, 2023

ymcki commented Nov 9, 2023

vellamike commented Nov 9, 2023 via email

ymcki commented Nov 10, 2023

shenker commented Nov 10, 2023

vellamike commented Dec 4, 2023

ymcki commented Dec 5, 2023

ymcki commented Dec 5, 2023

CanWood commented Jan 29, 2024 •

edited

Loading

shenker commented Jan 29, 2024

tnn111 commented Jan 29, 2024 via email

grendeloz commented Jan 30, 2024

tnn111 commented Jan 30, 2024 via email

Nvidia L4 GPU #459

Nvidia L4 GPU #459

Comments

alexinthis commented Nov 8, 2023

vellamike commented Nov 8, 2023

ymcki commented Nov 9, 2023

vellamike commented Nov 9, 2023 via email

ymcki commented Nov 10, 2023

shenker commented Nov 10, 2023

vellamike commented Dec 4, 2023

ymcki commented Dec 5, 2023

ymcki commented Dec 5, 2023

CanWood commented Jan 29, 2024 • edited Loading

shenker commented Jan 29, 2024

tnn111 commented Jan 29, 2024 via email

grendeloz commented Jan 30, 2024

tnn111 commented Jan 30, 2024 via email

CanWood commented Jan 29, 2024 •

edited

Loading