JIT: Improve block unrolling, enable AVX-512 #85501

EgorBo · 2023-04-28T01:40:01Z

This PR does 3 things:

When we perform non-zeroed constant Fill(..) - load value vector from the data section instead of manually crafting it from a GPR register, E.g.:

void Test(Span<byte> span) => span.Slice(0, 32).Fill(1)

; Method Proga:Test(System.Span`1[ubyte]):this
       sub      rsp, 40
       vzeroupper 
       cmp      dword ptr [rdx+08H], 32
       jb       SHORT G_M61135_IG04
       mov      rax, bword ptr [rdx]
-      mov      rdx, 0x101010101010101
-      vmovd    xmm0, rdx
-      vpunpckldq xmm0, xmm0
-      vinsertf128 ymm0, ymm0, xmm0, 1
+      vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovdqu  ymmword ptr [rax], ymm0
       add      rsp, 40
       ret      
G_M61135_IG04:  
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
-; Total bytes of code: 57
+RWD00  	dq	0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
+; Total bytes of code: 40

PR enables AVX-512 for constant sized non-zero fill, e.g.:

void Test(Span<byte> span) => span.Slice(0, 256).Fill(1);

; Method Proga:Test(System.Span`1[ubyte]):this
       sub      rsp, 40
       vzeroupper 
       cmp      dword ptr [rdx+08H], 256
       jb       SHORT G_M61135_IG04
       mov      rax, bword ptr [rdx]
       vmovups  zmm0, zmmword ptr [reloc @RWD00]
       vmovdqu32 zmmword ptr [rax], zmm0
       vmovdqu32 zmmword ptr [rax+40H], zmm0
       vmovdqu32 zmmword ptr [rax+80H], zmm0
       vmovdqu32 zmmword ptr [rax+C0H], zmm0
       add      rsp, 40
       ret      
G_M61135_IG04:  
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
RWD00  	dq	0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h, 
                0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
; Total bytes of code: 68

We no longer request a GPR reg to handle remainder if we can use SIMD and just perform an overlapped load for the last
(might still leave previous logic for the zeroing since xor gpr, gpr is cheap)

ghost · 2023-04-28T01:40:13Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR does 3 things:

When we perform non-zeroed constant Fill(..) - load value vector from the data section instead of manually crafting it from a GPR register, E.g.:

void Test(Span<byte> span) => span.Slice(0, 32).Fill(1)

; Method Proga:Test(System.Span`1[ubyte]):this
       sub      rsp, 40
       vzeroupper 
       cmp      dword ptr [rdx+08H], 32
       jb       SHORT G_M61135_IG04
       mov      rax, bword ptr [rdx]
-      mov      rdx, 0x101010101010101
-      vmovd    xmm0, rdx
-      vpunpckldq xmm0, xmm0
-      vinsertf128 ymm0, ymm0, xmm0, 1
+      vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovdqu  ymmword ptr [rax], ymm0
       add      rsp, 40
       ret      
G_M61135_IG04:  
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
-; Total bytes of code: 57
+RWD00  	dq	0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
+; Total bytes of code: 40

PR enables AVX-512 for constant sized non-zero fill, e.g.:

void Test(Span<byte> span) => span.Slice(0, 256).Fill(1);

; Method Proga:Test(System.Span`1[ubyte]):this
       sub      rsp, 40
       vzeroupper 
       cmp      dword ptr [rdx+08H], 256
       jb       SHORT G_M61135_IG04
       mov      rax, bword ptr [rdx]
       vmovups  zmm0, zmmword ptr [reloc @RWD00]
       vmovdqu32 zmmword ptr [rax], zmm0
       vmovdqu32 zmmword ptr [rax+40H], zmm0
       vmovdqu32 zmmword ptr [rax+80H], zmm0
       vmovdqu32 zmmword ptr [rax+C0H], zmm0
       add      rsp, 40
       ret      
G_M61135_IG04:  
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
RWD00  	dq	0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h, 
                     0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
; Total bytes of code: 68

We no longer request a GPR reg to handle remainder if we can use SIMD and just perform an overlapped load for the last
(might still leave previous logic for the zeroing since xor gpr, gpr is cheap)

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

omariom · 2023-04-28T15:49:02Z

When we perform non-zeroed constant Fill(..) - load value vector from the data section instead of manually crafting it from a GPR register

Interesting.
Is it faster to load 32 bytes from memory, than 8 bytes from the instruction stream and then manipulate the value?

EgorBo · 2023-04-28T15:56:39Z

Interesting. Is it faster to load 32 bytes from memory, than 8 bytes from the instruction stream and then manipulate the value?

Can you clarify the question as I'm not sure I understand. We already have a simd reg populated with the fill value we used previously. We can either use it to handle the remainder or populate a new GPR reg with a value mov reg, imm and then perform store. so I don't see a difference here between numbers of memory loads.

SIMD store migth recieve a penalty by crossing the cache/page boundary where smaller GPR won't. But at the same time we don't request precious GPR regs from LSRA allocator and the codegen is typically smaller

EgorBo · 2023-04-28T16:00:37Z

@omariom
ah, did you mean this diff:

-      mov      rdx, 0x101010101010101
-      vmovd    xmm0, rdx
-      vpunpckldq xmm0, xmm0
-      vinsertf128 ymm0, ymm0, xmm0, 1
+      vmovups  ymm0, ymmword ptr [reloc @RWD00]

? then yes, it is faster. It is also what we always do for any VectorX constants

LLVM-MCA for Skylake:

Iterations:        100
Instructions:      400
Total Cycles:      306
Total uOps:        400

Dispatch Width:    6
uOps Per Cycle:    1.31
IPC:               1.31
Block RThroughput: 3.0

vs

Iterations:        100
Instructions:      100
Total Cycles:      59
Total uOps:        100

Dispatch Width:    6
uOps Per Cycle:    1.69
IPC:               1.69
Block RThroughput: 0.5

EgorBo · 2023-05-02T22:57:26Z

/azp list

EgorBo · 2023-05-02T22:58:44Z

/azp run runtime-coreclr outerloop, runtime-coreclr jitstressregs

azure-pipelines · 2023-05-02T22:59:07Z

Azure Pipelines successfully started running 2 pipeline(s).

EgorBo · 2023-05-03T09:48:42Z

@dotnet/jit-contrib PTAL, small refactoring, diffs - outerloop/jitstressregs passed

tannergooding · 2023-05-03T12:39:36Z

@EgorBo, could you adjust the diff's given above slightly.

I recently fixed the codegen to allow using:

; Previous
mov rdx, 0x101010101010101
vmovd xmm0, rdx
vpunpckldq xmm0, xmm0
vinsertf128 ymm0, ymm0, xmm0, 1

; on AVX capable
mov edx, 0x10101010
vmovd xmm0, edx
vpbroadcastd ymm0, xmm0

; on AVX512 capable
mov edx, 0x10101010
vpbroadcastd ymm0, edx

The movups is probably still better due to the tighter instruction stream and similar execution time when coming from L1 cache, but it's likely worth re-confirming given the updated broadcast sequence should be ~5 cycles.

src/coreclr/jit/codegenxarch.cpp

src/coreclr/jit/lowerxarch.cpp

EgorBo · 2023-05-03T12:51:20Z

@EgorBo, could you adjust the diff's given above slightly.

I recently fixed the codegen to allow using:
; Previous
mov rdx, 0x101010101010101
vmovd xmm0, rdx
vpunpckldq xmm0, xmm0
vinsertf128 ymm0, ymm0, xmm0, 1

; on AVX capable
mov edx, 0x10101010
vmovd xmm0, edx
vpbroadcastd ymm0, xmm0

; on AVX512 capable
mov edx, 0x10101010
vpbroadcastd ymm0, edx
The movups is probably still better due to the tighter instruction stream and similar execution time when coming from L1 cache, but it's likely worth re-confirming given the updated broadcast sequence should be ~5 cycles.

I don't want to involve a temp GPR reg here - we'd better leave it for RA for something more useful, it is also the reason why I handle the remainder via SIMD

EgorBo · 2023-05-03T12:59:40Z

@EgorBo, could you adjust the diff's given above slightly.
I recently fixed the codegen to allow using:
; Previous
mov rdx, 0x101010101010101
vmovd xmm0, rdx
vpunpckldq xmm0, xmm0
vinsertf128 ymm0, ymm0, xmm0, 1

; on AVX capable
mov edx, 0x10101010
vmovd xmm0, edx
vpbroadcastd ymm0, xmm0

; on AVX512 capable
mov edx, 0x10101010
vpbroadcastd ymm0, edx
The movups is probably still better due to the tighter instruction stream and similar execution time when coming from L1 cache, but it's likely worth re-confirming given the updated broadcast sequence should be ~5 cycles.
I don't want to involve a temp GPR reg here - we'd better leave it for RA for something more useful, it is also the reason why I handle the remainder via SIMD

LLVM seems also to prefer movaps even for AVX512 - https://godbolt.org/z/xofK73191

src/coreclr/jit/codegen.h

EgorBo · 2023-05-03T19:19:54Z

Failure is #85637

Clean up BLK unrolling on xarch

a4ee994

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 28, 2023

ghost assigned EgorBo Apr 28, 2023

fix build

0463672

build-analysis bot mentioned this pull request Apr 28, 2023

Tracking issue for CI build timeouts #76454

Closed

EgorBo marked this pull request as ready for review April 28, 2023 09:49

Update codegenxarch.cpp

c5cc7fe

EgorBo added 6 commits April 28, 2023 23:54

fix 32bit

c6d174a

Simplify

e95ad54

fix build

2c7c62f

revert some changes

6add091

Merge branch 'main' into cleanup-blk-fill

6e54d57

clean up

b97e09f

This comment was marked as duplicate.

Sign in to view

This was referenced May 3, 2023

Failures in System.Net.Mail.Tests.SmtpClientTest tests #85637

Closed

System.IO.Tests.RandomAccess_NoBuffering.ReadUsingSingleBuffer timing out #85659

Closed