-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add early length checks to TryParse implementations #491
base: master
Are you sure you want to change the base?
Conversation
Most TryParse implementations that we have contain a series of calls to other try_parse() implementations. Most of these are for primitive types, e.g. u32. Thus, some rough sketch of the code is this: let (a, remaining) = u32::try_parse(remaining)?; let (b, remaining) = u32::try_parse(remaining)?; let (c, remaining) = u32::try_parse(remaining)?; let (d, remaining) = u32::try_parse(remaining)?; This is relatively readable code. We do not have to mess round with byte offsets or something like that. However, every single call to u32::try_parse() checks if there are another 32 bit of data available. This leads to lots of repetitions in the assembly code, because these length checks are not merged. This commit changes the code generator to compute the minimum size that some object can have on the wire. This minimum size is the sum of all fixed-size data. Basically, this means that lists are assumed to be empty. Then, the generated code checks if at least the minimum number of bytes is available. If not, an early returns is done. This early length check allows the compiler (in release mode) to optimise away most later length checks, thus resulting in less assembly code. The effect of this commit was measured in two ways. 1. "cargo rustc --release -- --Copt-level=s --emit asm" allows to get an assembly file for the library. Before this commit, this assembly file has 159662 lines. Afterwards, it has 157796 lines. This is a reduction of about 1.2 %. 2. "cargo build --release --example xclock_utc" and calling "strip" on the resulting binary previously produced a file with 503 KiB. After this commit, the file has only 499 KiB. Thus, about 4 KiB or 1 % is saved. Related-to: #488 Signed-off-by: Uli Schlachter <[email protected]>
Frankly it looks like rustc/llvm are doing a terrible job here. Here's one I pulled at random from a build (without your PR) with nightly rustc. _ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17hb503842b081124b2E:
movq %rdi, %rax
shrq $40, %rsi
cmpb $5, %sil
jne .LBB4888_13
cmpq $2, %rcx
jb .LBB4888_16
je .LBB4888_16
cmpq $3, %rcx
je .LBB4888_16
cmpq $4, %rcx
je .LBB4888_16
cmpb $0, 4(%rdx)
setne %sil
cmpq $5, %rcx
je .LBB4888_16
movb 5(%rdx), %dil
cmpb $4, %dil
jae .LBB4888_13
cmpq $6, %rcx
je .LBB4888_16
cmpq $7, %rcx
je .LBB4888_16
cmpq $8, %rcx
je .LBB4888_16
cmpq $9, %rcx
je .LBB4888_16
andq $-2, %rcx
cmpq $10, %rcx
jne .LBB4888_12
.LBB4888_16:
movb $0, 1(%rax)
movb $1, %cl
movb %cl, (%rax)
retq
.LBB4888_13:
movb $1, 1(%rax)
movb $1, %cl
movb %cl, (%rax)
retq
.LBB4888_12:
movzwl (%rdx), %ecx
movb 2(%rdx), %r8b
movb 3(%rdx), %r9b
movb 6(%rdx), %r10b
cmpb $0, 9(%rdx)
movzwl 10(%rdx), %edx
movw %cx, 2(%rax)
movw %dx, 4(%rax)
movb %r8b, 6(%rax)
movb %r9b, 7(%rax)
movb %sil, 8(%rax)
movb %dil, 9(%rax)
movb %r10b, 10(%rax)
setne 11(%rax)
xorl %ecx, %ecx
movb %cl, (%rax)
retq There's a lot of sadness here. Things that jump out at me:
Less bad:
So it seems like up fronts bounds checks would remove a lot of this badness, though really rustc/llvm should be doing better. |
I filed rust-lang/rust#73512 |
My PR actually misses the all the
I get basically the same with 1.42.0. When adding an explicit .section .text._ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E,"ax",@progbits
.globl _ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E
.type _ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E,@function
_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E:
.cfi_startproc
movq %rdi, %rax
cmpq $16, %rcx
jb .LBB5284_2
shrq $40, %rsi
cmpb $5, %sil
jne .LBB5284_2
movb 5(%rdx), %cl
cmpb $4, %cl
jae .LBB5284_2
xorl %r8d, %r8d
cmpb %r8b, 4(%rdx)
movzwl (%rdx), %edi
movzwl 2(%rdx), %esi
movb 6(%rdx), %r9b
movb 9(%rdx), %r10b
movzwl 10(%rdx), %edx
setne 6(%rax)
cmpb %r8b, %r10b
movw %di, (%rax)
movw %dx, 2(%rax)
movw %si, 4(%rax)
movb %cl, 7(%rax)
movb %r9b, 8(%rax)
setne 9(%rax)
retq
.LBB5284_2:
movb $2, 6(%rax)
retq
.Lfunc_end5284:
.size _ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E, .Lfunc_end5284-_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E
.cfi_endproc |
Rather surprisingly LLVM doesn't seem to make any effort to deduplicate basic blocks. This is at the end of https://gist.github.com/khuey/eedf7183f427323d1c373c1ead25f01d bb55, bb63, bb72, and bb80 are all effectively identical but LLVM refuses to condense them and eliminate the extra branches. |
Most TryParse implementations that we have contain a series of calls to
other try_parse() implementations. Most of these are for primitive
types, e.g. u32. Thus, some rough sketch of the code is this:
let (a, remaining) = u32::try_parse(remaining)?;
let (b, remaining) = u32::try_parse(remaining)?;
let (c, remaining) = u32::try_parse(remaining)?;
let (d, remaining) = u32::try_parse(remaining)?;
This is relatively readable code. We do not have to mess round with byte
offsets or something like that. However, every single call to
u32::try_parse() checks if there are another 32 bit of data available.
This leads to lots of repetitions in the assembly code, because these
length checks are not merged.
This commit changes the code generator to compute the minimum size that
some object can have on the wire. This minimum size is the sum of all
fixed-size data. Basically, this means that lists are assumed to be
empty.
Then, the generated code checks if at least the minimum number of bytes
is available. If not, an early returns is done.
This early length check allows the compiler (in release mode) to
optimise away most later length checks, thus resulting in less assembly
code.
The effect of this commit was measured in two ways.
assembly file for the library. Before this commit, this assembly file
has 159662 lines. Afterwards, it has 157796 lines. This is a
reduction of about 1.2 %.
the resulting binary previously produced a file with 503 KiB. After
this commit, the file has only 499 KiB. Thus, about 4 KiB or 1 % is
saved.
Related-to: #488
Signed-off-by: Uli Schlachter [email protected]
On one hand, the results for this change are a bit underwhelming. On the other hand, this is a relative simple change. Saving 4 KiB just by doing more length checks is a large reduction for little work. Dunno if this PR is worth it, hence keeping it as draft status. Hopefully I can come up with more PRs like this and we can see how much all of this brings.