Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A word-based binary integer encoding approach. #306

Merged
merged 11 commits into from
Dec 5, 2023

Conversation

oscbyspro
Copy link
Contributor

@oscbyspro oscbyspro commented Nov 1, 2023

Salutations! I come in peace ✌️

Note

This pull request builds on an earlier contribution.

Description

These changes aim to achieve two things. The first is making the binary integer text encoding algorithm non-generic so that models such as DoubleWidth perform well. The second is performing mutating division, making arbitrary precision faster. Copying the binary integer's words to a temporary allocation achieves both. Since the new algorithm takes a collection of words, it can also test big integers without big integer models.

Measurements

I have published a small package with some benchmarks.

  • v0 : the old version
  • v0x: the old version + @inlinable
  • v1 : the 1st version
  • v1x: the 1st version + @inlinable (2x) + @usableFromInline (1x)
  • v2 : the 2nd version
  • v2x: the 2nd version + @inlinable (2x) + @usableFromInline (1x)

The results on MacBook Pro (13-inch, M1, 2020), Xcode 15.0.1, in seconds

v0 v0x v1 v1x v2 v2x Numberick Stdlib
Int 1.501 0.181 0.248 0.169 0.216 0.135 0.167 0.169
UInt 0.226 0.191 0.238 0.158 0.213 0.133 0.163 0.163
Int256 35.541 0.620 0.737 0.420 0.714 0.410 0.423 85.640
UInt256 29.107 0.569 0.684 0.409 0.676 0.399 0.419 84.488
Int(±1) 1.403 0.151 0.234 0.155 0.122 0.043 0.060 0.060
UInt(1) 0.189 0.156 0.222 0.143 0.119 0.039 0.060 0.060
1x fib 106 1.010 1.006 0.916 0.914 0.912 0.914 0.912 13.163

These changes aim to achieve two things. The first is making the binary integer text encoding algorithm non-generic so that models such as DoubleWidth perform well. The second is performing mutating division, making arbitrary precision faster. Copying the binary integer's words to a temporary allocation achieves both. Since the new algorithm takes a collection of words, it can also test big integers without big integer models.
@itingliu
Copy link
Contributor

itingliu commented Nov 1, 2023

@wadetregaskis Do you want to take a look?

@wadetregaskis
Copy link
Contributor

@wadetregaskis Do you want to take a look?

Yeah, I'm happy to, but it probably won't be until later this week. I already looked over it a bit when it was sent to me originally, and I think overall it looks good. I might have some questions, but don't wait on me, if everyone else is happy with this patch already.

@iCharlesHu
Copy link
Contributor

@swift-ci please test

Copy link
Contributor

@wadetregaskis wadetregaskis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall, to me.

Since we're now focusing on performance, I spent a fair bit of time analysing the actual machine code (for x86-64) that this results in (in 'release' build configuration). It's impressive what the compiler is able to do to reduce all this down to some pretty simple and short loops.

Although if I see it use lea to decrement a value by one again, I shall be very annoyed. 😜

// - Int.init(some BinaryFloatingPoint) rounds to zero.
// - Double.init(exactly:) and UInt.init(_:) for correctness.
// - log10(2.0) is: 1.0011010001000001001101⌈01...⌉ * 2^(-2).
return Int(Double(exactly: UInt(bitWidth))! * log10(2.0)) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work correctly for bitWidths of 1 or less. Those aren't values that will appear at runtime, but it might be worth putting an assert here to help ensure that just in case some future refactor exposes this to wider use.

And/or document this limitation in the function documentation.

Also, changing the parameter type to UInt will at least preclude negative values.

Copy link
Contributor Author

@oscbyspro oscbyspro Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns 1 for 0 and 1. 1 is an upper bound for the decimal digits required to represent the at-most-1-bit values 0 and 1. UInt(_:) traps negative values. I can add a comment about it. W.r.t. correctness it also traps for 2^53 or somesuch not representable by Double.

import Foundation // in Swift.playground

func maxDecimalDigitCount(bitWidth: Int) -> Int {
    Int(Double(exactly: UInt(bitWidth))! * log10(2.0)) + 1
}

for bitWidth in 0 ... UInt64.bitWidth {
    let maxCount = maxDecimalDigitCount(bitWidth: bitWidth)
    let maxValue = UInt64.max >> (UInt.max.bitWidth - bitWidth)
    precondition(String(maxValue, radix: 10).count <= maxCount)
    precondition(maxValue.bitWidth - maxValue.leadingZeroBitCount == bitWidth)
    precondition(maxValue.nonzeroBitCount == bitWidth)
    print("bitWidth: \(bitWidth), maxCount: \(maxCount), maxValue: \(maxValue)")
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologise, you're quite right. I misunderstood the algorithm here (I think re. the rounding behaviour). I've now tested it empirically and indeed it works as intended.

The value of 1 for bit width 0 is okay - that case shouldn't ever arise in practice anyway - although a precondition(0 < bitWidth, …) might be wise (or an assert).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero bit width case occurs when an integer's words collection is empty. I don't have an integer like that, but you could perhaps justify an unsigned big integer normalizing zero to []. In any case, I have unit tests for empty collections, which result in "0".

var remainder = UInt.zero

for index in (dividend.startIndex ..< dividend.endIndex).reversed() {
(dividend.base[index], remainder) = divisor.dividingFullWidth((high: remainder, low: dividend.base[index]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a shame that this results in the use of udivti3 from compiler-rt instead of a simple divq instruction (on x86-64; presumably there’s an equivalent instruction on AArch64). Evidently the compiler doesn’t believe that the result will never overflow (udivti3 does the extra work of checking for and trapping on overflow, which is a waste here).

Perhaps there's some way to convince the compiler that the checks are unnecessary? Overflow should be impossible in practice, by the inherent constraints of its use here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an astute observation! I'd love to know if there's a better way to do it. I have an open issue in Numberick (#59) about division being slower than expected in some cases. I wrote a note about udivmodti4 taking up most of the time in Instruments, but I did not know how to proceed beyond that. It's a bit funky that it performs oveflow checking when it truncates the result anyways:

print(UInt(123).dividingFullWidth((high: 122, low: UInt.max))) // (quotient: UInt.max, remainder: 122)
print(UInt(123).dividingFullWidth((high: 123, low: 00000000))) // (quotient: 00000000, remainder: 000)

Copy link
Contributor

@wadetregaskis wadetregaskis Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mistaken about __udivti3 re. the overflow case. It's doesn't actually trap, it just truncates, as you observed. That's what one gets for trusting StackOverflow. 😝

For completeness… it's essentially using __udivmodti4 because __udivti3 is basically an alias; it's two instructions: one to zero out r8 to effect a NULL for the out parameter that reports the remainder, and then a jump to __udivmodti4 (which incidentally is a waste of space since __udivmodti4 is the very next procedure, so it'd flow into it automatically without the jmp).

__udivmodti4 is a little complicated, both at the C level and in the resulting assembly (67 instructions, on x86-64, and branchy). It does use divq, ultimately (it's actually hard-coded to on x86-64, via inline assembly). But you still have the overhead in the function call and unnecessary checks to use other paths - it really would benefit from being inlined. I don't know how to make the compiler inline it, though.

Which raises the question: why on earth isn't it just using __udivmodti4 directly, since we do care about the remainder here. It appears that it's using the result of the division to calculate the remainder manually:

cmp  r12, rdi ; Check that the index is > 0.  On first entry to this loop it's endIndex.
jle  loc_19eb ; Basically, crash if index is <= 0.
lea  rbx, qword [r12-1] ; Decrement index of current word in `words`.
mov  r13, qword [r15+r12*8-8] ; Load next word from `words` argument.
mov  rdi, r13 ; low half of 128-bit argument "a" (dividend) for __udivti3.  rsi is high half, and is initially set to zero but is the remainder in subsequent iterations.
mov  rdx, r14 ; low half of 128-bit argument "b" (divisor) for __udivti3.
xor  ecx, ecx ; high half of argument "b".  Top 32 bits are known zero from earlier in the function.
call __udivti3 ; rax = a / b
mov  rdi, qword [rbp+var_38] ; words.count.
mov  rcx, rax
mul  r14 ; Quotient * divisor (the function argument, not the argument to __udivti3.
mov  rsi, r13 ; The aforementioned next word from `words`.
sub  rsi, rax ; Remainder = current word - (Quotient * divisor)
mov  qword [r15+r12*8-8], rcx ; Save the quotient over the current word in `words`.
mov  r12, rbx ; Just shifting the current index around registers.
cmp  rdi, rbx ; Check that index is != words.startIndex.
jne  loc_19a0 ; Repeat this whole block while it isn't.

I don't see how that could be faster than using __udivmodti4, since the latter calculates the remainder no matter what - it merely drops it silently if its 'rem' out argument is NULL, as it is in the use via __udivti3. Weird. Seems like an optimiser bug.

@oscbyspro
Copy link
Contributor Author

oscbyspro commented Nov 9, 2023

Oh, neat. I appreciate your feedback. I'll address each point in time.

Note

The algorithm is fundamentally sign-magnitude, and it would be simpler to call magnitude.words and skip the form-two's-complement method. I have tried it, and it performs similarly when inlinable, but DoubleWidth<T> performs 3x worse when warring against module boundaries, compared to calling self.words.

@itingliu
Copy link
Contributor

@swift-ci please test

oscbyspro and others added 4 commits November 10, 2023 09:13
Some comments w.r.t. expecting Swift.BinaryInteger semantics plus some formatting.

Co-authored-by: Wade Tregaskis <[email protected]>
Co-authored-by: Tina Liu <[email protected]>
The withUnsafeTemporaryAllocation(of:capacity:) makes no guarantees about the size of the allocated buffer pointer, even though the current implementation covers exactly the requested number of elements.
@oscbyspro
Copy link
Contributor Author

oscbyspro commented Nov 10, 2023

Hm. The documentation of Swift.withUnsafeTemporaryAllocation(of:capacity:) states:

The implementation may allocate a larger buffer pointer than is strictly necessary to contain capacity values of type. The behavior of a program that attempts to access any such additional storage is undefined.

Does that speak of the buffer in the closure, or is it just implementation details leaking through documentation?


It sounds like it refers to the buffer in the closure because that's how Swift uses the term capacity elsewhere:

for capacity in 0 ... 32 {
    String(unsafeUninitializedCapacity: capacity) {
        print("capacity: \(capacity), count: \($0.count)")
        precondition($0.count == Swift.max(16, capacity))
        return 0
    }
}

Copy link
Contributor

@itingliu itingliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glessard Do you want to take a look, specifically about the usage around the unsafemutableBufferPointer? And also if we should consider sinking these into stdlib?


// Add a minus sign to negative values.
if isLessThanZero {
ascii.formIndex(before: &asciiIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my information how do we know that asciiIndex is guaranteed to be >= ascii.startIndex, and this surely doesn't go out of bound?

Copy link
Contributor Author

@oscbyspro oscbyspro Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't see your question earlier. The capacity is given by the following expression so please have a look at number-of-decimal-digits-in-a-binary-integer.

let capacity = maxDecimalDigitCountForUnsignedInteger(bitWidth: words.count * UInt.bitWidth) + (isLessThanZero ? 1 : 0)

Since capacity is only an upper bound, I've used the fullest possible width as an uncomplicated approximaton (it can be trimmed a bit). In the loop, we then divide by 10 repeatedly and write each digit back-to-front. The loop is on repeat, so we write at least one. The minimum capacity is 1 when words.count == 0, and 1 is added when isLessThanZero == true.

let isLessThanZero = isSigned && Int(bitPattern: words.last ?? 0) < 0 // tests the last bit when isSigned == true

Important edge case thing-y

The loop branchlessly decrements asciiIndex by radix.exponent (the digits written plus zeros for padding) each iteration, so it might be negative at the end of the loop. This is one reason why we set it to writeIndex afterwards, the other is trimming the last iteration's padding. This assumes Index == Int (alternatives exist).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, no index goes out of bounds or assumes typing in v2. I've also replaced all assertions with preconditions since the difference in performance was insignificant.

@itingliu
Copy link
Contributor

@swift-ci please test

@wadetregaskis
Copy link
Contributor

wadetregaskis commented Nov 13, 2023

Hm. The documentation of Swift.withUnsafeTemporaryAllocation(of:capacity:) states:

The implementation may allocate a larger buffer pointer than is strictly necessary to contain capacity values of type. The behavior of a program that attempts to access any such additional storage is undefined.

Does that speak of the buffer in the closure, or is it just implementation details leaking through documentation?

It sounds like it refers to the buffer in the closure because that's how Swift uses the term capacity elsewhere:

for capacity in 0 ... 32 {
    String(unsafeUninitializedCapacity: capacity) {
        print("capacity: \(capacity), count: \($0.count)")
        precondition($0.count == Swift.max(16, capacity))
        return 0
    }
}

I believe it means that $0.count is guaranteed to be >= capacity, not merely ==.

It makes sense as such. In the String case specifically, hypothetically for a capacity <= 15 it can provide a buffer pointing to the actual stack space allocated for the String (using the String's "inline" form to avoid a heap allocation), so it might as well just pass 15 in that case. And of course for its non-inline forms its heap allocations have to be multiples of 16 at least (malloc on Apple platforms has never supported allocation granularities smaller than 16 bytes).

It turns out the current implementation of String does do this optimisation - of directly initialising the stack memory for capacities <= 15.

For withUnsafeTemporaryAllocation(of:capacity:) specifically, it probably has to preserve stack alignment - so the actual capacity will likely always be some multiple of 8 or whatever the alignment requirements are - and of course if it does do a heap allocation it's back to malloc slab sizes.

Talking in 64-bit here. 32-bit the same principles apply but the numbers might be different.

@oscbyspro
Copy link
Contributor Author

I have added a signed and unsigned tiny integer benchmark to see whether fast pathing is worth it. I've also measured an @inlinable version of the original implementation to see what is and isn't a module boundary issue. The original fast path does, indeed, improve the non-@inlinable performance of tiny integers, but Swift's standard library beats it. It seems, to me, that returning a String and stack allocating small integers is more impactful.

Note

I've changed to Xcode 15.0.1, so everything is faster.

@glessard
Copy link
Contributor

The implementation may allocate a larger buffer pointer than is strictly necessary to contain capacity values of type. The behavior of a program that attempts to access any such additional storage is undefined.

Does that speak of the buffer in the closure, or is it just implementation details leaking through documentation?

I think this is a leak of implementation details, coupled with an API documentation deserving some improvements.

@oscbyspro
Copy link
Contributor Author

oscbyspro commented Nov 15, 2023

Quick update: I'm working on a new version that should be speedy for small integers. I'll try to publish it within the next couple of days. It looks promising. It involves returning a String instead of an ArraySlice, so I hope that's OK. The String's ASCII sequence can be accessed using withUTF8(_:).

- I changed the return type from ArraySlice to String.
- I changed the ICU number formatter's parameter from ArraySlice to String.
- I changed the structure, although it's similar to `v1`.
- I added `nextUp` to `log10(2)` because it's not obvious that it rounds up (it does).
@oscbyspro
Copy link
Contributor Author

I'm back! I've updated the pull request to v2 (in the original post).

Summary

The return type is String, and tiny integers now beat the standard library when @inlinable.

Changes

  • I changed the return type from ArraySlice to String.
  • I changed the ICU number formatter's parameter from ArraySlice to String.
  • I changed the structure, although it's similar to v1.
  • I added nextUp to log10(2) because it's not obvious that it rounds up (it does).

Comments

The ICU format is undocumented, and value: String does not indicate ASCII.

@oscbyspro
Copy link
Contributor Author

oscbyspro commented Nov 16, 2023

As a side note, I saw that BinaryInteger's formatted() method casts to Int, trapping big integers:

extension BinaryInteger {

    /// Format `self` using `IntegerFormatStyle()`
    public func formatted() -> String {
        IntegerFormatStyle().format(Int(self))
    }
}

@oscbyspro
Copy link
Contributor Author

I have added the 1,000,000th Fibonacci sequence element (210k decimal digits) to the table 😃

@glessard
Copy link
Contributor

@swift-ci please test

- Slicing (rebasing) the magnitude per review comments.
- Changed division parameter to buffer from buffer slice.
- Changed assertions to preconditions because the difference in performance is small.
- UnsafeMutableBufferPointer.init(start:count:) takes an optional (no force unwrapping).
@oscbyspro
Copy link
Contributor Author

I sense no enthusiasm—and nobody has made it a priority—should I close it? I know  has a habit of keeping things in limbo, but I don't. I'd much rather cross this PR off my list and go somewhere my contributions are appreciated 🤷‍♂️

Copy link
Contributor

@itingliu itingliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed reply. I personally do want to take this one ✌️

@swift-ci please test

@itingliu
Copy link
Contributor

itingliu commented Dec 4, 2023

@swift-ci please test

@itingliu itingliu merged commit 845652a into swiftlang:main Dec 5, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants