-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Improve block unrolling, enable AVX-512 #85501
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis PR does 3 things:
void Test(Span<byte> span) => span.Slice(0, 32).Fill(1) ; Method Proga:Test(System.Span`1[ubyte]):this
sub rsp, 40
vzeroupper
cmp dword ptr [rdx+08H], 32
jb SHORT G_M61135_IG04
mov rax, bword ptr [rdx]
- mov rdx, 0x101010101010101
- vmovd xmm0, rdx
- vpunpckldq xmm0, xmm0
- vinsertf128 ymm0, ymm0, xmm0, 1
+ vmovups ymm0, ymmword ptr [reloc @RWD00]
vmovdqu ymmword ptr [rax], ymm0
add rsp, 40
ret
G_M61135_IG04:
call [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
int3
-; Total bytes of code: 57
+RWD00 dq 0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
+; Total bytes of code: 40
void Test(Span<byte> span) => span.Slice(0, 256).Fill(1); ; Method Proga:Test(System.Span`1[ubyte]):this
sub rsp, 40
vzeroupper
cmp dword ptr [rdx+08H], 256
jb SHORT G_M61135_IG04
mov rax, bword ptr [rdx]
vmovups zmm0, zmmword ptr [reloc @RWD00]
vmovdqu32 zmmword ptr [rax], zmm0
vmovdqu32 zmmword ptr [rax+40H], zmm0
vmovdqu32 zmmword ptr [rax+80H], zmm0
vmovdqu32 zmmword ptr [rax+C0H], zmm0
add rsp, 40
ret
G_M61135_IG04:
call [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
int3
RWD00 dq 0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h,
0101010101010101h, 0101010101010101h, 0101010101010101h, 0101010101010101h
; Total bytes of code: 68
|
Interesting. |
Can you clarify the question as I'm not sure I understand. We already have a simd reg populated with the fill value we used previously. We can either use it to handle the remainder or populate a new GPR reg with a value SIMD store migth recieve a penalty by crossing the cache/page boundary where smaller GPR won't. But at the same time we don't request precious GPR regs from LSRA allocator and the codegen is typically smaller |
@omariom - mov rdx, 0x101010101010101
- vmovd xmm0, rdx
- vpunpckldq xmm0, xmm0
- vinsertf128 ymm0, ymm0, xmm0, 1
+ vmovups ymm0, ymmword ptr [reloc @RWD00] ? then yes, it is faster. It is also what we always do for any LLVM-MCA for Skylake:
vs
|
/azp list |
This comment was marked as duplicate.
This comment was marked as duplicate.
/azp run runtime-coreclr outerloop, runtime-coreclr jitstressregs |
Azure Pipelines successfully started running 2 pipeline(s). |
@dotnet/jit-contrib PTAL, small refactoring, diffs - outerloop/jitstressregs passed |
@EgorBo, could you adjust the diff's given above slightly. I recently fixed the codegen to allow using: ; Previous
mov rdx, 0x101010101010101
vmovd xmm0, rdx
vpunpckldq xmm0, xmm0
vinsertf128 ymm0, ymm0, xmm0, 1
; on AVX capable
mov edx, 0x10101010
vmovd xmm0, edx
vpbroadcastd ymm0, xmm0
; on AVX512 capable
mov edx, 0x10101010
vpbroadcastd ymm0, edx The |
I don't want to involve a temp GPR reg here - we'd better leave it for RA for something more useful, it is also the reason why I handle the remainder via SIMD |
LLVM seems also to prefer movaps even for AVX512 - https://godbolt.org/z/xofK73191 |
Failure is #85637 |
This PR does 3 things:
(might still leave previous logic for the zeroing since
xor gpr, gpr
is cheap)