Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

mandroid6 · 2025-02-04T23:46:00Z

Currently cutlass profiler lists down all the arguments to the benchmark but doesn't list down per kernel values for cluster_k, cluster_m and cluster_n.

This change updates the profiler report generation to include these arguments.

Before:

As we see below, the values for cluster_m,cluster_n,cluster_k are missing in the kernel result.

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,use_pdl,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,false,1,tensorop,f32,128,128,64,,,,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.235348,414.944,620633

After:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,use_pdl,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,false,1,tensorop,f32,128,128,64,1,2,1,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.235348,414.944,620633

Repro commands:

Build cutlass

git clone https://github.com/NVIDIA/cutlass
cd cutlass
mkdir build
cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_LIBRARY_KERNELS=cutlass3x_sm90_tensorop_s*16gemm_bf16_bf16_f32_bf16_bf16_*tnn* -DCUTLASS_ENABLE_TESTS=OFF -GNinja -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL=9992 -DCUTLASS_LIBRARY_OPERATIONS=Gemm

Run profiler

 ./tools/profiler/cutlass_profiler --operation=Gemm --output=data --dist=gaussian,mean:0.0,stddev:1.0,scale:-1 --m=4352 --n=4096 --k=4096 --A=bf16:row --B=bf16:column --C=bf16:column --D=bf16:column

Currently cutlass profiler lists down all the arguments to the benchmark but doesn't list down per kernel values for cluster_k, cluster_m and cluster_n. This change updates the profiler report generation to include these arguments.

mandroid6 · 2025-02-05T00:03:44Z

@hwu36 @kerrmudgeon

hwu36 · 2025-02-05T12:04:35Z

@itramble , could you please review first?

mandroid6 · 2025-02-11T18:46:41Z

@itramble could you help take a look? (cc @hwu36 )

itramble · 2025-02-12T19:29:24Z

Hi @mandroid6, thanks for raising this. I think this was changed recently. As of today, I see:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,runtime_input_datatype_a,runtime_input_datatype_b,use_pdl,enable_sm90_mixed_dtype_shuffle_test,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,cluster_m_fallback,cluster_n_fallback,cluster_k_fallback,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x2x1_0_tnn_align8,incorrect,success,universal,4352,4096,4096,bf16:row,bf16:column,bf16:column,bf16:column,1,0,serial,1,1,heuristic,invalid,invalid,false,false,1,tensorop,f32,128,128,64,1,1,1,0,0,0,7,4,2,1,64,128,16,90,90,104857600,146064539648,1392,0.359317,271.783,406506

Unfortunately, this is not entirely correct either. We currently report the "cluster*" arguments that were passed to the profiler (or defaults, see here). We do this because there is a new Blackwell feature for using runtime cluster shapes (described here) in addition to static compile-time cluster shapes that were supported for Hopper. Runtime cluster shapes are indicated when one of operation_desc.tile_description.cluster_shape.m/n/k() is 0. When none of the cluster_shapes are 0 (true for Hopper CUTLASS kernels), then your change is correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

mandroid6 commented Feb 4, 2025 •

edited

Loading

mandroid6 commented Feb 5, 2025

hwu36 commented Feb 5, 2025 •

edited

Loading

mandroid6 commented Feb 11, 2025

itramble commented Feb 12, 2025

Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

Are you sure you want to change the base?

Allow cluster sizes across m,n,k to be reported in cutlass profiler #2078

Conversation

mandroid6 commented Feb 4, 2025 • edited Loading

Repro commands:

mandroid6 commented Feb 5, 2025

hwu36 commented Feb 5, 2025 • edited Loading

mandroid6 commented Feb 11, 2025

itramble commented Feb 12, 2025

mandroid6 commented Feb 4, 2025 •

edited

Loading

hwu36 commented Feb 5, 2025 •

edited

Loading