A word-based binary integer encoding approach. #306

oscbyspro · 2023-11-01T17:59:35Z

Salutations! I come in peace ✌️

Note

This pull request builds on an earlier contribution.

Description

These changes aim to achieve two things. The first is making the binary integer text encoding algorithm non-generic so that models such as DoubleWidth perform well. The second is performing mutating division, making arbitrary precision faster. Copying the binary integer's words to a temporary allocation achieves both. Since the new algorithm takes a collection of words, it can also test big integers without big integer models.

Measurements

I have published a small package with some benchmarks.

v0 : the old version
v0x: the old version + @inlinable
v1 : the 1st version
v1x: the 1st version + @inlinable (2x) + @usableFromInline (1x)
v2 : the 2nd version
v2x: the 2nd version + @inlinable (2x) + @usableFromInline (1x)

The results on MacBook Pro (13-inch, M1, 2020), Xcode 15.0.1, in seconds

	v0	v0x	v1	v1x	v2	v2x	Numberick	Stdlib
Int	1.501	0.181	0.248	0.169	0.216	0.135	0.167	0.169
UInt	0.226	0.191	0.238	0.158	0.213	0.133	0.163	0.163
Int256	35.541	0.620	0.737	0.420	0.714	0.410	0.423	85.640
UInt256	29.107	0.569	0.684	0.409	0.676	0.399	0.419	84.488
Int(±1)	1.403	0.151	0.234	0.155	0.122	0.043	0.060	0.060
UInt(1)	0.189	0.156	0.222	0.143	0.119	0.039	0.060	0.060
1x fib 10⁶	1.010	1.006	0.916	0.914	0.912	0.914	0.912	13.163

These changes aim to achieve two things. The first is making the binary integer text encoding algorithm non-generic so that models such as DoubleWidth perform well. The second is performing mutating division, making arbitrary precision faster. Copying the binary integer's words to a temporary allocation achieves both. Since the new algorithm takes a collection of words, it can also test big integers without big integer models.

itingliu · 2023-11-01T18:27:04Z

@wadetregaskis Do you want to take a look?

wadetregaskis · 2023-11-01T22:48:04Z

@wadetregaskis Do you want to take a look?

Yeah, I'm happy to, but it probably won't be until later this week. I already looked over it a bit when it was sent to me originally, and I think overall it looks good. I might have some questions, but don't wait on me, if everyone else is happy with this patch already.

iCharlesHu · 2023-11-07T22:21:56Z

@swift-ci please test

wadetregaskis

Looks pretty good overall, to me.

Since we're now focusing on performance, I spent a fair bit of time analysing the actual machine code (for x86-64) that this results in (in 'release' build configuration). It's impressive what the compiler is able to do to reduce all this down to some pretty simple and short loops.

Although if I see it use lea to decrement a value by one again, I shall be very annoyed. 😜

Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift

wadetregaskis · 2023-11-09T06:32:58Z

Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift

+    // - Int.init(some BinaryFloatingPoint) rounds to zero.
+    // - Double.init(exactly:) and UInt.init(_:) for correctness.
+    // - log10(2.0) is: 1.0011010001000001001101⌈01...⌉ * 2^(-2).
+    return Int(Double(exactly: UInt(bitWidth))! * log10(2.0)) + 1


This doesn't work correctly for bitWidths of 1 or less. Those aren't values that will appear at runtime, but it might be worth putting an assert here to help ensure that just in case some future refactor exposes this to wider use.

And/or document this limitation in the function documentation.

Also, changing the parameter type to UInt will at least preclude negative values.

The function returns 1 for 0 and 1. 1 is an upper bound for the decimal digits required to represent the at-most-1-bit values 0 and 1. UInt(_:) traps negative values. I can add a comment about it. W.r.t. correctness it also traps for 2^53 or somesuch not representable by Double.

import Foundation // in Swift.playground func maxDecimalDigitCount(bitWidth: Int) -> Int { Int(Double(exactly: UInt(bitWidth))! * log10(2.0)) + 1 } for bitWidth in 0 ... UInt64.bitWidth { let maxCount = maxDecimalDigitCount(bitWidth: bitWidth) let maxValue = UInt64.max >> (UInt.max.bitWidth - bitWidth) precondition(String(maxValue, radix: 10).count <= maxCount) precondition(maxValue.bitWidth - maxValue.leadingZeroBitCount == bitWidth) precondition(maxValue.nonzeroBitCount == bitWidth) print("bitWidth: \(bitWidth), maxCount: \(maxCount), maxValue: \(maxValue)") }

I apologise, you're quite right. I misunderstood the algorithm here (I think re. the rounding behaviour). I've now tested it empirically and indeed it works as intended.

The value of 1 for bit width 0 is okay - that case shouldn't ever arise in practice anyway - although a precondition(0 < bitWidth, …) might be wise (or an assert).

The zero bit width case occurs when an integer's words collection is empty. I don't have an integer like that, but you could perhaps justify an unsigned big integer normalizing zero to []. In any case, I have unit tests for empty collections, which result in "0".

wadetregaskis · 2023-11-09T07:21:41Z

Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift

+    var remainder = UInt.zero
+
+    for index in (dividend.startIndex ..< dividend.endIndex).reversed() {
+        (dividend.base[index], remainder) = divisor.dividingFullWidth((high: remainder, low: dividend.base[index]))


It’s a shame that this results in the use of udivti3 from compiler-rt instead of a simple divq instruction (on x86-64; presumably there’s an equivalent instruction on AArch64). Evidently the compiler doesn’t believe that the result will never overflow (udivti3 does the extra work of checking for and trapping on overflow, which is a waste here).

Perhaps there's some way to convince the compiler that the checks are unnecessary? Overflow should be impossible in practice, by the inherent constraints of its use here.

That's an astute observation! I'd love to know if there's a better way to do it. I have an open issue in Numberick (#59) about division being slower than expected in some cases. I wrote a note about udivmodti4 taking up most of the time in Instruments, but I did not know how to proceed beyond that. It's a bit funky that it performs oveflow checking when it truncates the result anyways:

print(UInt(123).dividingFullWidth((high: 122, low: UInt.max))) // (quotient: UInt.max, remainder: 122) print(UInt(123).dividingFullWidth((high: 123, low: 00000000))) // (quotient: 00000000, remainder: 000)

I was mistaken about __udivti3 re. the overflow case. It's doesn't actually trap, it just truncates, as you observed. That's what one gets for trusting StackOverflow. 😝

For completeness… it's essentially using __udivmodti4 because __udivti3 is basically an alias; it's two instructions: one to zero out r8 to effect a NULL for the out parameter that reports the remainder, and then a jump to __udivmodti4 (which incidentally is a waste of space since __udivmodti4 is the very next procedure, so it'd flow into it automatically without the jmp).

__udivmodti4 is a little complicated, both at the C level and in the resulting assembly (67 instructions, on x86-64, and branchy). It does use divq, ultimately (it's actually hard-coded to on x86-64, via inline assembly). But you still have the overhead in the function call and unnecessary checks to use other paths - it really would benefit from being inlined. I don't know how to make the compiler inline it, though.

Which raises the question: why on earth isn't it just using __udivmodti4 directly, since we do care about the remainder here. It appears that it's using the result of the division to calculate the remainder manually:

cmp r12, rdi ; Check that the index is > 0. On first entry to this loop it's endIndex. jle loc_19eb ; Basically, crash if index is <= 0. lea rbx, qword [r12-1] ; Decrement index of current word in `words`. mov r13, qword [r15+r12*8-8] ; Load next word from `words` argument. mov rdi, r13 ; low half of 128-bit argument "a" (dividend) for __udivti3. rsi is high half, and is initially set to zero but is the remainder in subsequent iterations. mov rdx, r14 ; low half of 128-bit argument "b" (divisor) for __udivti3. xor ecx, ecx ; high half of argument "b". Top 32 bits are known zero from earlier in the function. call __udivti3 ; rax = a / b mov rdi, qword [rbp+var_38] ; words.count. mov rcx, rax mul r14 ; Quotient * divisor (the function argument, not the argument to __udivti3. mov rsi, r13 ; The aforementioned next word from `words`. sub rsi, rax ; Remainder = current word - (Quotient * divisor) mov qword [r15+r12*8-8], rcx ; Save the quotient over the current word in `words`. mov r12, rbx ; Just shifting the current index around registers. cmp rdi, rbx ; Check that index is != words.startIndex. jne loc_19a0 ; Repeat this whole block while it isn't.

I don't see how that could be faster than using __udivmodti4, since the latter calculates the remainder no matter what - it merely drops it silently if its 'rem' out argument is NULL, as it is in the use via __udivti3. Weird. Seems like an optimiser bug.

oscbyspro · 2023-11-09T08:36:17Z

Oh, neat. I appreciate your feedback. I'll address each point in time.

Note

The algorithm is fundamentally sign-magnitude, and it would be simpler to call magnitude.words and skip the form-two's-complement method. I have tried it, and it performs similarly when inlinable, but DoubleWidth<T> performs 3x worse when warring against module boundaries, compared to calling self.words.

itingliu · 2023-11-10T01:37:30Z

@swift-ci please test

Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift

Some comments w.r.t. expecting Swift.BinaryInteger semantics plus some formatting. Co-authored-by: Wade Tregaskis <[email protected]> Co-authored-by: Tina Liu <[email protected]>

The withUnsafeTemporaryAllocation(of:capacity:) makes no guarantees about the size of the allocated buffer pointer, even though the current implementation covers exactly the requested number of elements.

oscbyspro · 2023-11-10T14:19:18Z

Hm. The documentation of Swift.withUnsafeTemporaryAllocation(of:capacity:) states:

The implementation may allocate a larger buffer pointer than is strictly necessary to contain capacity values of type. The behavior of a program that attempts to access any such additional storage is undefined.

Does that speak of the buffer in the closure, or is it just implementation details leaking through documentation?

It sounds like it refers to the buffer in the closure because that's how Swift uses the term capacity elsewhere:

for capacity in 0 ... 32 {
    String(unsafeUninitializedCapacity: capacity) {
        print("capacity: \(capacity), count: \($0.count)")
        precondition($0.count == Swift.max(16, capacity))
        return 0
    }
}

itingliu

@glessard Do you want to take a look, specifically about the usage around the unsafemutableBufferPointer? And also if we should consider sinking these into stdlib?

itingliu · 2023-11-11T00:35:14Z

Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift

+
+        // Add a minus sign to negative values.
+        if isLessThanZero {
+            ascii.formIndex(before: &asciiIndex)


For my information how do we know that asciiIndex is guaranteed to be >= ascii.startIndex, and this surely doesn't go out of bound?

Sorry, I didn't see your question earlier. The capacity is given by the following expression so please have a look at number-of-decimal-digits-in-a-binary-integer.

let capacity = maxDecimalDigitCountForUnsignedInteger(bitWidth: words.count * UInt.bitWidth) + (isLessThanZero ? 1 : 0)

Since capacity is only an upper bound, I've used the fullest possible width as an uncomplicated approximaton (it can be trimmed a bit). In the loop, we then divide by 10 repeatedly and write each digit back-to-front. The loop is on repeat, so we write at least one. The minimum capacity is 1 when words.count == 0, and 1 is added when isLessThanZero == true.

let isLessThanZero = isSigned && Int(bitPattern: words.last ?? 0) < 0 // tests the last bit when isSigned == true

Important edge case thing-y

The loop branchlessly decrements asciiIndex by radix.exponent (the digits written plus zeros for padding) each iteration, so it might be negative at the end of the loop. This is one reason why we set it to writeIndex afterwards, the other is trimming the last iteration's padding. This assumes Index == Int (alternatives exist).

For what it's worth, no index goes out of bounds or assumes typing in v2. I've also replaced all assertions with preconditions since the difference in performance was insignificant.

Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift

itingliu · 2023-11-11T01:08:18Z

@swift-ci please test

wadetregaskis · 2023-11-13T01:30:09Z

Hm. The documentation of Swift.withUnsafeTemporaryAllocation(of:capacity:) states:

The implementation may allocate a larger buffer pointer than is strictly necessary to contain capacity values of type. The behavior of a program that attempts to access any such additional storage is undefined.

Does that speak of the buffer in the closure, or is it just implementation details leaking through documentation?

It sounds like it refers to the buffer in the closure because that's how Swift uses the term capacity elsewhere:
for capacity in 0 ... 32 {
    String(unsafeUninitializedCapacity: capacity) {
        print("capacity: \(capacity), count: \($0.count)")
        precondition($0.count == Swift.max(16, capacity))
        return 0
    }
}

I believe it means that $0.count is guaranteed to be >= capacity, not merely ==.

It makes sense as such. In the String case specifically, hypothetically for a capacity <= 15 it can provide a buffer pointing to the actual stack space allocated for the String (using the String's "inline" form to avoid a heap allocation), so it might as well just pass 15 in that case. And of course for its non-inline forms its heap allocations have to be multiples of 16 at least (malloc on Apple platforms has never supported allocation granularities smaller than 16 bytes).

It turns out the current implementation of String does do this optimisation - of directly initialising the stack memory for capacities <= 15.

For withUnsafeTemporaryAllocation(of:capacity:) specifically, it probably has to preserve stack alignment - so the actual capacity will likely always be some multiple of 8 or whatever the alignment requirements are - and of course if it does do a heap allocation it's back to malloc slab sizes.

Talking in 64-bit here. 32-bit the same principles apply but the numbers might be different.

oscbyspro · 2023-11-14T12:32:20Z

I have added a signed and unsigned tiny integer benchmark to see whether fast pathing is worth it. I've also measured an @inlinable version of the original implementation to see what is and isn't a module boundary issue. The original fast path does, indeed, improve the non-@inlinable performance of tiny integers, but Swift's standard library beats it. It seems, to me, that returning a String and stack allocating small integers is more impactful.

Note

I've changed to Xcode 15.0.1, so everything is faster.

glessard · 2023-11-14T20:13:40Z

The implementation may allocate a larger buffer pointer than is strictly necessary to contain capacity values of type. The behavior of a program that attempts to access any such additional storage is undefined.

Does that speak of the buffer in the closure, or is it just implementation details leaking through documentation?

I think this is a leak of implementation details, coupled with an API documentation deserving some improvements.

oscbyspro · 2023-11-15T12:26:06Z

Quick update: I'm working on a new version that should be speedy for small integers. I'll try to publish it within the next couple of days. It looks promising. It involves returning a String instead of an ArraySlice, so I hope that's OK. The String's ASCII sequence can be accessed using withUTF8(_:).

- I changed the return type from ArraySlice to String. - I changed the ICU number formatter's parameter from ArraySlice to String. - I changed the structure, although it's similar to `v1`. - I added `nextUp` to `log10(2)` because it's not obvious that it rounds up (it does).

oscbyspro · 2023-11-16T09:15:07Z

I'm back! I've updated the pull request to v2 (in the original post).

Summary

The return type is String, and tiny integers now beat the standard library when @inlinable.

Changes

I changed the return type from ArraySlice to String.
I changed the ICU number formatter's parameter from ArraySlice to String.
I changed the structure, although it's similar to v1.
I added nextUp to log10(2) because it's not obvious that it rounds up (it does).

Comments

The ICU format is undocumented, and value: String does not indicate ASCII.

oscbyspro · 2023-11-16T09:16:14Z

As a side note, I saw that BinaryInteger's formatted() method casts to Int, trapping big integers:

extension BinaryInteger {

    /// Format `self` using `IntegerFormatStyle()`
    public func formatted() -> String {
        IntegerFormatStyle().format(Int(self))
    }
}

oscbyspro · 2023-11-17T10:36:30Z

I have added the 1,000,000th Fibonacci sequence element (210k decimal digits) to the table 😃

glessard · 2023-11-17T23:15:25Z

@swift-ci please test

…`if`.

- Slicing (rebasing) the magnitude per review comments. - Changed division parameter to buffer from buffer slice. - Changed assertions to preconditions because the difference in performance is small. - UnsafeMutableBufferPointer.init(start:count:) takes an optional (no force unwrapping).

oscbyspro · 2023-12-04T08:32:49Z

I sense no enthusiasm—and nobody has made it a priority—should I close it? I know  has a habit of keeping things in limbo, but I don't. I'd much rather cross this PR off my list and go somewhere my contributions are appreciated 🤷‍♂️

itingliu

Apologies for the delayed reply. I personally do want to take this one ✌️

@swift-ci please test

itingliu · 2023-12-04T23:40:29Z

@swift-ci please test

oscbyspro added 2 commits November 1, 2023 18:27

Undid internal to public numeric string representation typo.

6e9bc2e

Some minor cleanup.

b971063

wadetregaskis reviewed Nov 9, 2023

View reviewed changes

itingliu reviewed Nov 10, 2023

View reviewed changes

oscbyspro and others added 4 commits November 10, 2023 09:13

Apply suggestions from code review

c1fa8bf

Some comments w.r.t. expecting Swift.BinaryInteger semantics plus some formatting. Co-authored-by: Wade Tregaskis <[email protected]> Co-authored-by: Tina Liu <[email protected]>

Changed to using subscripts per review.

1a97230

Some formatting along the lines of previous review comments.

f1bab7b

Updated temporary allocation steps while solving missing spacer nit.

907667a

The withUnsafeTemporaryAllocation(of:capacity:) makes no guarantees about the size of the allocated buffer pointer, even though the current implementation covers exactly the requested number of elements.

itingliu reviewed Nov 11, 2023

View reviewed changes

oscbyspro added 2 commits November 16, 2023 09:09

Updated numeric string representation tests.

9c72cb0

oscbyspro added 2 commits November 18, 2023 08:20

Edited some comments. Renamed a variable. Replaced a guard with an …

03179ab

…`if`.

itingliu approved these changes Dec 4, 2023

View reviewed changes

itingliu merged commit 845652a into swiftlang:main Dec 5, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A word-based binary integer encoding approach. #306

A word-based binary integer encoding approach. #306

oscbyspro commented Nov 1, 2023 •

edited

Loading

itingliu commented Nov 1, 2023

wadetregaskis commented Nov 1, 2023

iCharlesHu commented Nov 7, 2023

wadetregaskis left a comment

wadetregaskis Nov 9, 2023

oscbyspro Nov 9, 2023 •

edited

Loading

wadetregaskis Nov 12, 2023

oscbyspro Nov 16, 2023

wadetregaskis Nov 9, 2023

oscbyspro Nov 11, 2023

wadetregaskis Nov 13, 2023 •

edited

Loading

oscbyspro commented Nov 9, 2023 •

edited

Loading

itingliu commented Nov 10, 2023

oscbyspro commented Nov 10, 2023 •

edited

Loading

itingliu left a comment

itingliu Nov 11, 2023

oscbyspro Nov 15, 2023 •

edited

Loading

oscbyspro Nov 20, 2023

itingliu commented Nov 11, 2023

wadetregaskis commented Nov 13, 2023 •

edited

Loading

oscbyspro commented Nov 14, 2023

glessard commented Nov 14, 2023

oscbyspro commented Nov 15, 2023 •

edited

Loading

oscbyspro commented Nov 16, 2023

oscbyspro commented Nov 16, 2023 •

edited

Loading

oscbyspro commented Nov 17, 2023

glessard commented Nov 17, 2023

oscbyspro commented Dec 4, 2023

itingliu left a comment

itingliu commented Dec 4, 2023

A word-based binary integer encoding approach. #306

A word-based binary integer encoding approach. #306

Conversation

oscbyspro commented Nov 1, 2023 • edited Loading

Description

Measurements

The results on MacBook Pro (13-inch, M1, 2020), Xcode 15.0.1, in seconds

itingliu commented Nov 1, 2023

wadetregaskis commented Nov 1, 2023

iCharlesHu commented Nov 7, 2023

wadetregaskis left a comment

Choose a reason for hiding this comment

wadetregaskis Nov 9, 2023

Choose a reason for hiding this comment

oscbyspro Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

wadetregaskis Nov 12, 2023

Choose a reason for hiding this comment

oscbyspro Nov 16, 2023

Choose a reason for hiding this comment

wadetregaskis Nov 9, 2023

Choose a reason for hiding this comment

oscbyspro Nov 11, 2023

Choose a reason for hiding this comment

wadetregaskis Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

oscbyspro commented Nov 9, 2023 • edited Loading

itingliu commented Nov 10, 2023

oscbyspro commented Nov 10, 2023 • edited Loading

itingliu left a comment

Choose a reason for hiding this comment

itingliu Nov 11, 2023

Choose a reason for hiding this comment

oscbyspro Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Important edge case thing-y

oscbyspro Nov 20, 2023

Choose a reason for hiding this comment

itingliu commented Nov 11, 2023

wadetregaskis commented Nov 13, 2023 • edited Loading

oscbyspro commented Nov 14, 2023

glessard commented Nov 14, 2023

oscbyspro commented Nov 15, 2023 • edited Loading

oscbyspro commented Nov 16, 2023

Summary

Changes

Comments

oscbyspro commented Nov 16, 2023 • edited Loading

oscbyspro commented Nov 17, 2023

glessard commented Nov 17, 2023

oscbyspro commented Dec 4, 2023

itingliu left a comment

Choose a reason for hiding this comment

itingliu commented Dec 4, 2023

oscbyspro commented Nov 1, 2023 •

edited

Loading

oscbyspro Nov 9, 2023 •

edited

Loading

wadetregaskis Nov 13, 2023 •

edited

Loading

oscbyspro commented Nov 9, 2023 •

edited

Loading

oscbyspro commented Nov 10, 2023 •

edited

Loading

oscbyspro Nov 15, 2023 •

edited

Loading

wadetregaskis commented Nov 13, 2023 •

edited

Loading

oscbyspro commented Nov 15, 2023 •

edited

Loading

oscbyspro commented Nov 16, 2023 •

edited

Loading