Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia L4 GPU #459

Open
alexinthis opened this issue Nov 8, 2023 · 13 comments
Open

Nvidia L4 GPU #459

alexinthis opened this issue Nov 8, 2023 · 13 comments

Comments

@alexinthis
Copy link

Hi All,

I have read through a lot of tickets here. I understand the software is optimised for A100/H100's however they are very costly. I am looking to build a single server for a research team and the new L4 hits the right price point and gives good performance, particularly for single precision (float). The reason the H100 and A100 could be a lot faster is their ability to do double precision calculations.

For example
L4 = 30,300 GFLOPS single (float), or 490 double GFLOPS - that's really low on the doubles.
H100 = 51,200 GLOPS single (float), or 25,600 double GFLOPS - much higher ratio on the doubles.

Looking at the code there seems to be a lot of usage of the type float, however I do see double being used in some areas.

Can anyone give any advise on this particular area of GPU selection?

@vellamike
Copy link
Collaborator

The reason the H100 and A100 are a lot faster for Dorado is mainly there ability to run INT8 operations during basecalling. We don't use fp64 anywhere on the GPU.

L4 appears to have decent specs, but Dorado isn't currently optimised for it (it will fall back to fp16 instead of INT8).

Depending on the price, you may find L4 to be a more cost-effective option, it's unlikely you will be able to infer the performance from the specs though - your best bet is to benchmark.

Mike

@ymcki
Copy link

ymcki commented Nov 9, 2023

Dear Mike, L4's spec says it supports Tensor Core INT8:
https://www.nvidia.com/en-us/data-center/l4/
What do you mean by "it will fall back to FP16"?

@vellamike
Copy link
Collaborator

vellamike commented Nov 9, 2023 via email

@ymcki
Copy link

ymcki commented Nov 10, 2023

Dear Mike, thanks for your follow up.
Can you further clarify what you mean by "due to other features of this GPU"? Is this GPU missing some hardware feature that A100/H100 has but it has not? What is the name of this hardware feature?

My understanding is that L4's architecture is AD104 whereas A100/H100 are GA100/GH100. So if dorado is optimized for A100/H100, then it should also be optimized for A800/H800. Presumably, it should also be optimized and run on INT8 for A30 as it is also an GA100 card but of course with slower VRAM. Is that correct?

@shenker
Copy link

shenker commented Nov 10, 2023

Will the L40S also use fp16 instead of int8 kernels? (Is this something that could be fixed in a future update or, like @ymcki asked, is it due to intrinsic hardware limitations?)

@vellamike
Copy link
Collaborator

Hi, apologies for the delay in replying to your follow up questions.

Can you further clarify what you mean by "due to other features of this GPU"? Is this GPU missing some hardware feature that A100/H100 has but it has not? What is the name of this hardware feature?

Essentially the shared cache of the AD104 is smaller than for the GA100/GH100 (128KB vc 164KB) , which means that our INT8 kernels currently cannot execute on L4, this will also apply to L40S.

Adding support for INT8 to these lower-SMEM GPUs is something we are investigating, so it is likely Dorado will be faster on these GPUs in the future.

@ymcki
Copy link

ymcki commented Dec 5, 2023

Thanks for your reply. Based on my research, L1 shared memory seems to be defined by compute capability version. Here is a list of compute capabilities and their L1 shared memory size that also has Tensor Core INT8 support:

Shared Memory Compute Capability
228KB 9.0
164KB 8.0
100KB 8.9
100KB 8.6
64KB 7.5

Apparently, it will be a great news for limited budget labs if 100KB can also be supported.

@ymcki
Copy link

ymcki commented Dec 5, 2023

On paper, the GA100 A30 GPU's performance is about half the performance of A100.

I remember someone said without the INT8 support, performance will be halved. Since A30 also has 164KB SMEM, so I
suppose its using INT8 all the time without performance penalty.

So for the time being, we can expect A30 to have similar performance to L40 such that it can be a good choice for limited budge labs?

@CanWood
Copy link

CanWood commented Jan 29, 2024

@vellamike :

Adding support for INT8 to these lower-SMEM GPUs is something we are investigating, so it is likely Dorado will be faster on these GPUs in the future.

We're in conversations with vendors and leaning towards L40s GPUs to serve some of our user base and dorado is heavily used on our site. We'd be interested to know how these investigations are going as, depending on the progress, we may pick up the L40s systems in the hopes that their dorado performance will improve

Cheers

@shenker
Copy link

shenker commented Jan 29, 2024

We also ended up putting in an order for a bunch of L40S's, and hope that INT8 kernel support is added soon.

@tnn111
Copy link

tnn111 commented Jan 29, 2024 via email

@grendeloz
Copy link

It's not comprehensive but there are some benchmarks at:
https://github.com/quadram-institute-bioscience/dorado-gpu-benchmarking

@tnn111
Copy link

tnn111 commented Jan 30, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants