Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Improve block unrolling, enable AVX-512 #85501

Merged
merged 11 commits into from
May 3, 2023

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Apr 28, 2023

This PR does 3 things:

  1. When we perform non-zeroed constant Fill(..) - load value vector from the data section instead of manually crafting it from a GPR register, E.g.:
void Test(Span<byte> span) => span.Slice(0, 32).Fill(1)
; Method Proga:Test(System.Span`1[ubyte]):this
       sub      rsp, 40
       vzeroupper 
       cmp      dword ptr [rdx+08H], 32
       jb       SHORT G_M61135_IG04
       mov      rax, bword ptr [rdx]
-      mov      rdx, 0x101010101010101
-      vmovd    xmm0, rdx
-      vpunpckldq xmm0, xmm0
-      vinsertf128 ymm0, ymm0, xmm0, 1
+      vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovdqu  ymmword ptr [rax], ymm0
       add      rsp, 40
       ret      
G_M61135_IG04:  
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
-; Total bytes of code: 57
+RWD00  	dq	0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
+; Total bytes of code: 40
  1. PR enables AVX-512 for constant sized non-zero fill, e.g.:
void Test(Span<byte> span) => span.Slice(0, 256).Fill(1);
; Method Proga:Test(System.Span`1[ubyte]):this
       sub      rsp, 40
       vzeroupper 
       cmp      dword ptr [rdx+08H], 256
       jb       SHORT G_M61135_IG04
       mov      rax, bword ptr [rdx]
       vmovups  zmm0, zmmword ptr [reloc @RWD00]
       vmovdqu32 zmmword ptr [rax], zmm0
       vmovdqu32 zmmword ptr [rax+40H], zmm0
       vmovdqu32 zmmword ptr [rax+80H], zmm0
       vmovdqu32 zmmword ptr [rax+C0H], zmm0
       add      rsp, 40
       ret      
G_M61135_IG04:  
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
RWD00  	dq	0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h, 
                0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
; Total bytes of code: 68
  1. We no longer request a GPR reg to handle remainder if we can use SIMD and just perform an overlapped load for the last
    (might still leave previous logic for the zeroing since xor gpr, gpr is cheap)

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 28, 2023
@ghost ghost assigned EgorBo Apr 28, 2023
@ghost
Copy link

ghost commented Apr 28, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR does 3 things:

  1. When we perform non-zeroed constant Fill(..) - load value vector from the data section instead of manually crafting it from a GPR register, E.g.:
void Test(Span<byte> span) => span.Slice(0, 32).Fill(1)
; Method Proga:Test(System.Span`1[ubyte]):this
       sub      rsp, 40
       vzeroupper 
       cmp      dword ptr [rdx+08H], 32
       jb       SHORT G_M61135_IG04
       mov      rax, bword ptr [rdx]
-      mov      rdx, 0x101010101010101
-      vmovd    xmm0, rdx
-      vpunpckldq xmm0, xmm0
-      vinsertf128 ymm0, ymm0, xmm0, 1
+      vmovups  ymm0, ymmword ptr [reloc @RWD00]
       vmovdqu  ymmword ptr [rax], ymm0
       add      rsp, 40
       ret      
G_M61135_IG04:  
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
-; Total bytes of code: 57
+RWD00  	dq	0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
+; Total bytes of code: 40
  1. PR enables AVX-512 for constant sized non-zero fill, e.g.:
void Test(Span<byte> span) => span.Slice(0, 256).Fill(1);
; Method Proga:Test(System.Span`1[ubyte]):this
       sub      rsp, 40
       vzeroupper 
       cmp      dword ptr [rdx+08H], 256
       jb       SHORT G_M61135_IG04
       mov      rax, bword ptr [rdx]
       vmovups  zmm0, zmmword ptr [reloc @RWD00]
       vmovdqu32 zmmword ptr [rax], zmm0
       vmovdqu32 zmmword ptr [rax+40H], zmm0
       vmovdqu32 zmmword ptr [rax+80H], zmm0
       vmovdqu32 zmmword ptr [rax+C0H], zmm0
       add      rsp, 40
       ret      
G_M61135_IG04:  
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
RWD00  	dq	0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h, 
                     0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
; Total bytes of code: 68
  1. We no longer request a GPR reg to handle remainder if we can use SIMD and just perform an overlapped load for the last
    (might still leave previous logic for the zeroing since xor gpr, gpr is cheap)
Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo marked this pull request as ready for review April 28, 2023 09:49
@omariom
Copy link
Contributor

omariom commented Apr 28, 2023

When we perform non-zeroed constant Fill(..) - load value vector from the data section instead of manually crafting it from a GPR register

Interesting.
Is it faster to load 32 bytes from memory, than 8 bytes from the instruction stream and then manipulate the value?

@EgorBo
Copy link
Member Author

EgorBo commented Apr 28, 2023

Interesting. Is it faster to load 32 bytes from memory, than 8 bytes from the instruction stream and then manipulate the value?

Can you clarify the question as I'm not sure I understand. We already have a simd reg populated with the fill value we used previously. We can either use it to handle the remainder or populate a new GPR reg with a value mov reg, imm and then perform store. so I don't see a difference here between numbers of memory loads.

SIMD store migth recieve a penalty by crossing the cache/page boundary where smaller GPR won't. But at the same time we don't request precious GPR regs from LSRA allocator and the codegen is typically smaller

@EgorBo
Copy link
Member Author

EgorBo commented Apr 28, 2023

@omariom
ah, did you mean this diff:

-      mov      rdx, 0x101010101010101
-      vmovd    xmm0, rdx
-      vpunpckldq xmm0, xmm0
-      vinsertf128 ymm0, ymm0, xmm0, 1
+      vmovups  ymm0, ymmword ptr [reloc @RWD00]

? then yes, it is faster. It is also what we always do for any VectorX constants

LLVM-MCA for Skylake:

Iterations:        100
Instructions:      400
Total Cycles:      306
Total uOps:        400

Dispatch Width:    6
uOps Per Cycle:    1.31
IPC:               1.31
Block RThroughput: 3.0

vs

Iterations:        100
Instructions:      100
Total Cycles:      59
Total uOps:        100

Dispatch Width:    6
uOps Per Cycle:    1.69
IPC:               1.69
Block RThroughput: 0.5

@EgorBo
Copy link
Member Author

EgorBo commented May 2, 2023

/azp list

@azure-pipelines

This comment was marked as duplicate.

@EgorBo
Copy link
Member Author

EgorBo commented May 2, 2023

/azp run runtime-coreclr outerloop, runtime-coreclr jitstressregs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@EgorBo
Copy link
Member Author

EgorBo commented May 3, 2023

@dotnet/jit-contrib PTAL, small refactoring, diffs - outerloop/jitstressregs passed

@tannergooding
Copy link
Member

@EgorBo, could you adjust the diff's given above slightly.

I recently fixed the codegen to allow using:

; Previous
mov rdx, 0x101010101010101
vmovd xmm0, rdx
vpunpckldq xmm0, xmm0
vinsertf128 ymm0, ymm0, xmm0, 1

; on AVX capable
mov edx, 0x10101010
vmovd xmm0, edx
vpbroadcastd ymm0, xmm0

; on AVX512 capable
mov edx, 0x10101010
vpbroadcastd ymm0, edx

The movups is probably still better due to the tighter instruction stream and similar execution time when coming from L1 cache, but it's likely worth re-confirming given the updated broadcast sequence should be ~5 cycles.

@EgorBo
Copy link
Member Author

EgorBo commented May 3, 2023

@EgorBo, could you adjust the diff's given above slightly.

I recently fixed the codegen to allow using:

; Previous
mov rdx, 0x101010101010101
vmovd xmm0, rdx
vpunpckldq xmm0, xmm0
vinsertf128 ymm0, ymm0, xmm0, 1

; on AVX capable
mov edx, 0x10101010
vmovd xmm0, edx
vpbroadcastd ymm0, xmm0

; on AVX512 capable
mov edx, 0x10101010
vpbroadcastd ymm0, edx

The movups is probably still better due to the tighter instruction stream and similar execution time when coming from L1 cache, but it's likely worth re-confirming given the updated broadcast sequence should be ~5 cycles.

I don't want to involve a temp GPR reg here - we'd better leave it for RA for something more useful, it is also the reason why I handle the remainder via SIMD

@EgorBo
Copy link
Member Author

EgorBo commented May 3, 2023

@EgorBo, could you adjust the diff's given above slightly.
I recently fixed the codegen to allow using:

; Previous
mov rdx, 0x101010101010101
vmovd xmm0, rdx
vpunpckldq xmm0, xmm0
vinsertf128 ymm0, ymm0, xmm0, 1

; on AVX capable
mov edx, 0x10101010
vmovd xmm0, edx
vpbroadcastd ymm0, xmm0

; on AVX512 capable
mov edx, 0x10101010
vpbroadcastd ymm0, edx

The movups is probably still better due to the tighter instruction stream and similar execution time when coming from L1 cache, but it's likely worth re-confirming given the updated broadcast sequence should be ~5 cycles.

I don't want to involve a temp GPR reg here - we'd better leave it for RA for something more useful, it is also the reason why I handle the remainder via SIMD

LLVM seems also to prefer movaps even for AVX512 - https://godbolt.org/z/xofK73191

@EgorBo EgorBo merged commit b4e14b4 into dotnet:main May 3, 2023
@EgorBo
Copy link
Member Author

EgorBo commented May 3, 2023

Failure is #85637

@EgorBo EgorBo deleted the cleanup-blk-fill branch May 3, 2023 19:19
@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label May 5, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants