Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add early length checks to TryParse implementations #491

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

psychon
Copy link
Owner

@psychon psychon commented Jun 19, 2020

Most TryParse implementations that we have contain a series of calls to
other try_parse() implementations. Most of these are for primitive
types, e.g. u32. Thus, some rough sketch of the code is this:

let (a, remaining) = u32::try_parse(remaining)?;
let (b, remaining) = u32::try_parse(remaining)?;
let (c, remaining) = u32::try_parse(remaining)?;
let (d, remaining) = u32::try_parse(remaining)?;

This is relatively readable code. We do not have to mess round with byte
offsets or something like that. However, every single call to
u32::try_parse() checks if there are another 32 bit of data available.
This leads to lots of repetitions in the assembly code, because these
length checks are not merged.

This commit changes the code generator to compute the minimum size that
some object can have on the wire. This minimum size is the sum of all
fixed-size data. Basically, this means that lists are assumed to be
empty.

Then, the generated code checks if at least the minimum number of bytes
is available. If not, an early returns is done.

This early length check allows the compiler (in release mode) to
optimise away most later length checks, thus resulting in less assembly
code.

The effect of this commit was measured in two ways.

  1. "cargo rustc --release -- --Copt-level=s --emit asm" allows to get an
    assembly file for the library. Before this commit, this assembly file
    has 159662 lines. Afterwards, it has 157796 lines. This is a
    reduction of about 1.2 %.
  2. "cargo build --release --example xclock_utc" and calling "strip" on
    the resulting binary previously produced a file with 503 KiB. After
    this commit, the file has only 499 KiB. Thus, about 4 KiB or 1 % is
    saved.

Related-to: #488
Signed-off-by: Uli Schlachter [email protected]


On one hand, the results for this change are a bit underwhelming. On the other hand, this is a relative simple change. Saving 4 KiB just by doing more length checks is a large reduction for little work. Dunno if this PR is worth it, hence keeping it as draft status. Hopefully I can come up with more PRs like this and we can see how much all of this brings.

Most TryParse implementations that we have contain a series of calls to
other try_parse() implementations. Most of these are for primitive
types, e.g. u32. Thus, some rough sketch of the code is this:

  let (a, remaining) = u32::try_parse(remaining)?;
  let (b, remaining) = u32::try_parse(remaining)?;
  let (c, remaining) = u32::try_parse(remaining)?;
  let (d, remaining) = u32::try_parse(remaining)?;

This is relatively readable code. We do not have to mess round with byte
offsets or something like that. However, every single call to
u32::try_parse() checks if there are another 32 bit of data available.
This leads to lots of repetitions in the assembly code, because these
length checks are not merged.

This commit changes the code generator to compute the minimum size that
some object can have on the wire. This minimum size is the sum of all
fixed-size data. Basically, this means that lists are assumed to be
empty.

Then, the generated code checks if at least the minimum number of bytes
is available. If not, an early returns is done.

This early length check allows the compiler (in release mode) to
optimise away most later length checks, thus resulting in less assembly
code.

The effect of this commit was measured in two ways.

1. "cargo rustc --release -- --Copt-level=s --emit asm" allows to get an
   assembly file for the library. Before this commit, this assembly file
   has 159662 lines. Afterwards, it has 157796 lines. This is a
   reduction of about 1.2 %.
2. "cargo build --release --example xclock_utc" and calling "strip" on
   the resulting binary previously produced a file with 503 KiB. After
   this commit, the file has only 499 KiB. Thus, about 4 KiB or 1 % is
   saved.

Related-to: #488
Signed-off-by: Uli Schlachter <[email protected]>
@khuey
Copy link
Contributor

khuey commented Jun 19, 2020

Frankly it looks like rustc/llvm are doing a terrible job here. Here's one I pulled at random from a build (without your PR) with nightly rustc.

_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17hb503842b081124b2E:
        movq    %rdi, %rax
        shrq    $40, %rsi
        cmpb    $5, %sil
        jne     .LBB4888_13
        cmpq    $2, %rcx
        jb      .LBB4888_16
        je      .LBB4888_16
        cmpq    $3, %rcx
        je      .LBB4888_16
        cmpq    $4, %rcx
        je      .LBB4888_16
        cmpb    $0, 4(%rdx)
        setne   %sil
        cmpq    $5, %rcx
        je      .LBB4888_16
        movb    5(%rdx), %dil
        cmpb    $4, %dil
        jae     .LBB4888_13
        cmpq    $6, %rcx
        je      .LBB4888_16
        cmpq    $7, %rcx
        je      .LBB4888_16
        cmpq    $8, %rcx
        je      .LBB4888_16
        cmpq    $9, %rcx
        je      .LBB4888_16
        andq    $-2, %rcx
        cmpq    $10, %rcx
        jne     .LBB4888_12
.LBB4888_16:
        movb    $0, 1(%rax)
        movb    $1, %cl
        movb    %cl, (%rax)
        retq
.LBB4888_13:
        movb    $1, 1(%rax)
        movb    $1, %cl
        movb    %cl, (%rax)
        retq
.LBB4888_12:
        movzwl  (%rdx), %ecx
        movb    2(%rdx), %r8b
        movb    3(%rdx), %r9b
        movb    6(%rdx), %r10b
        cmpb    $0, 9(%rdx)
        movzwl  10(%rdx), %edx
        movw    %cx, 2(%rax)
        movw    %dx, 4(%rax)
        movb    %r8b, 6(%rax)
        movb    %r9b, 7(%rax)
        movb    %sil, 8(%rax)
        movb    %dil, 9(%rax)
        movb    %r10b, 10(%rax)
        setne   11(%rax)
        xorl    %ecx, %ecx
        movb    %cl, (%rax)
        retq

There's a lot of sadness here. Things that jump out at me:

  • That a sequential jb <label>; je <label>; isn't condensed into a single jbe <label>; is absolutely wild.
  • That adjacent cmp $N, %rcx, je <label>; cmp $N+1, %rcx, je <label> aren't condensed into a single comparison is also very bad.

Less bad:

  • setne %sil could just write directly into 8(%rax) rather than a register.
  • Similarly this code could skip loading 5(%rdx) into %dil, though that may not actually be a perf win.
  • In the andq $-2, %rcx; cmpq $10, %rcx; jne <label> the and is totally unnecessary because we've already compared %rcx against anything below 10. This is admittedly harder for an optimizer to recognize. It's also not clear to me where this code even comes from.

So it seems like up fronts bounds checks would remove a lot of this badness, though really rustc/llvm should be doing better.

@khuey
Copy link
Contributor

khuey commented Jun 19, 2020

I filed rust-lang/rust#73512

@psychon
Copy link
Owner Author

psychon commented Jun 19, 2020

Here's one I pulled at random from a build (without your PR)

My PR actually misses the all the Request structs. So, this PR does not make a difference. Whoops.

with nightly rustc.

I get basically the same with 1.42.0.

When adding an explicit if value.len() < 16 { return Err(ParseError::ParseError); } at the beginning of that function, the code improves a bit and doesn't look too bad:

	.section	.text._ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E,"ax",@progbits
	.globl	_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E
	.type	_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E,@function
_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E:
	.cfi_startproc
	movq	%rdi, %rax
	cmpq	$16, %rcx
	jb	.LBB5284_2
	shrq	$40, %rsi
	cmpb	$5, %sil
	jne	.LBB5284_2
	movb	5(%rdx), %cl
	cmpb	$4, %cl
	jae	.LBB5284_2
	xorl	%r8d, %r8d
	cmpb	%r8b, 4(%rdx)
	movzwl	(%rdx), %edi
	movzwl	2(%rdx), %esi
	movb	6(%rdx), %r9b
	movb	9(%rdx), %r10b
	movzwl	10(%rdx), %edx
	setne	6(%rax)
	cmpb	%r8b, %r10b
	movw	%di, (%rax)
	movw	%dx, 2(%rax)
	movw	%si, 4(%rax)
	movb	%cl, 7(%rax)
	movb	%r9b, 8(%rax)
	setne	9(%rax)
	retq
.LBB5284_2:
	movb	$2, 6(%rax)
	retq
.Lfunc_end5284:
	.size	_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E, .Lfunc_end5284-_ZN5x11rb8protocol3xkb21LatchLockStateRequest17try_parse_request17h2dec0bd5f7c26762E
	.cfi_endproc

@khuey
Copy link
Contributor

khuey commented Jun 21, 2020

Rather surprisingly LLVM doesn't seem to make any effort to deduplicate basic blocks. This is at the end of opt -O3 on the IR that rustc produces

https://gist.github.com/khuey/eedf7183f427323d1c373c1ead25f01d

bb55, bb63, bb72, and bb80 are all effectively identical but LLVM refuses to condense them and eliminate the extra branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants