-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A word-based binary integer encoding approach. #306
Conversation
These changes aim to achieve two things. The first is making the binary integer text encoding algorithm non-generic so that models such as DoubleWidth perform well. The second is performing mutating division, making arbitrary precision faster. Copying the binary integer's words to a temporary allocation achieves both. Since the new algorithm takes a collection of words, it can also test big integers without big integer models.
@wadetregaskis Do you want to take a look? |
Yeah, I'm happy to, but it probably won't be until later this week. I already looked over it a bit when it was sent to me originally, and I think overall it looks good. I might have some questions, but don't wait on me, if everyone else is happy with this patch already. |
@swift-ci please test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good overall, to me.
Since we're now focusing on performance, I spent a fair bit of time analysing the actual machine code (for x86-64) that this results in (in 'release' build configuration). It's impressive what the compiler is able to do to reduce all this down to some pretty simple and short loops.
Although if I see it use lea
to decrement a value by one again, I shall be very annoyed. 😜
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
// - Int.init(some BinaryFloatingPoint) rounds to zero. | ||
// - Double.init(exactly:) and UInt.init(_:) for correctness. | ||
// - log10(2.0) is: 1.0011010001000001001101⌈01...⌉ * 2^(-2). | ||
return Int(Double(exactly: UInt(bitWidth))! * log10(2.0)) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work correctly for bitWidth
s of 1 or less. Those aren't values that will appear at runtime, but it might be worth putting an assert here to help ensure that just in case some future refactor exposes this to wider use.
And/or document this limitation in the function documentation.
Also, changing the parameter type to UInt
will at least preclude negative values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns 1 for 0 and 1. 1 is an upper bound for the decimal digits required to represent the at-most-1-bit values 0 and 1. UInt(_:) traps negative values. I can add a comment about it. W.r.t. correctness it also traps for 2^53 or somesuch not representable by Double.
import Foundation // in Swift.playground
func maxDecimalDigitCount(bitWidth: Int) -> Int {
Int(Double(exactly: UInt(bitWidth))! * log10(2.0)) + 1
}
for bitWidth in 0 ... UInt64.bitWidth {
let maxCount = maxDecimalDigitCount(bitWidth: bitWidth)
let maxValue = UInt64.max >> (UInt.max.bitWidth - bitWidth)
precondition(String(maxValue, radix: 10).count <= maxCount)
precondition(maxValue.bitWidth - maxValue.leadingZeroBitCount == bitWidth)
precondition(maxValue.nonzeroBitCount == bitWidth)
print("bitWidth: \(bitWidth), maxCount: \(maxCount), maxValue: \(maxValue)")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologise, you're quite right. I misunderstood the algorithm here (I think re. the rounding behaviour). I've now tested it empirically and indeed it works as intended.
The value of 1 for bit width 0 is okay - that case shouldn't ever arise in practice anyway - although a precondition(0 < bitWidth, …)
might be wise (or an assert).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zero bit width case occurs when an integer's words
collection is empty. I don't have an integer like that, but you could perhaps justify an unsigned big integer normalizing zero to []
. In any case, I have unit tests for empty collections, which result in "0"
.
var remainder = UInt.zero | ||
|
||
for index in (dividend.startIndex ..< dividend.endIndex).reversed() { | ||
(dividend.base[index], remainder) = divisor.dividingFullWidth((high: remainder, low: dividend.base[index])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s a shame that this results in the use of udivti3
from compiler-rt instead of a simple divq
instruction (on x86-64; presumably there’s an equivalent instruction on AArch64). Evidently the compiler doesn’t believe that the result will never overflow (udivti3
does the extra work of checking for and trapping on overflow, which is a waste here).
Perhaps there's some way to convince the compiler that the checks are unnecessary? Overflow should be impossible in practice, by the inherent constraints of its use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an astute observation! I'd love to know if there's a better way to do it. I have an open issue in Numberick (#59) about division being slower than expected in some cases. I wrote a note about udivmodti4 taking up most of the time in Instruments, but I did not know how to proceed beyond that. It's a bit funky that it performs oveflow checking when it truncates the result anyways:
print(UInt(123).dividingFullWidth((high: 122, low: UInt.max))) // (quotient: UInt.max, remainder: 122)
print(UInt(123).dividingFullWidth((high: 123, low: 00000000))) // (quotient: 00000000, remainder: 000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mistaken about __udivti3
re. the overflow case. It's doesn't actually trap, it just truncates, as you observed. That's what one gets for trusting StackOverflow. 😝
For completeness… it's essentially using __udivmodti4
because __udivti3
is basically an alias; it's two instructions: one to zero out r8 to effect a NULL
for the out parameter that reports the remainder, and then a jump to __udivmodti4
(which incidentally is a waste of space since __udivmodti4
is the very next procedure, so it'd flow into it automatically without the jmp
).
__udivmodti4
is a little complicated, both at the C level and in the resulting assembly (67 instructions, on x86-64, and branchy). It does use divq
, ultimately (it's actually hard-coded to on x86-64, via inline assembly). But you still have the overhead in the function call and unnecessary checks to use other paths - it really would benefit from being inlined. I don't know how to make the compiler inline it, though.
Which raises the question: why on earth isn't it just using __udivmodti4
directly, since we do care about the remainder here. It appears that it's using the result of the division to calculate the remainder manually:
cmp r12, rdi ; Check that the index is > 0. On first entry to this loop it's endIndex.
jle loc_19eb ; Basically, crash if index is <= 0.
lea rbx, qword [r12-1] ; Decrement index of current word in `words`.
mov r13, qword [r15+r12*8-8] ; Load next word from `words` argument.
mov rdi, r13 ; low half of 128-bit argument "a" (dividend) for __udivti3. rsi is high half, and is initially set to zero but is the remainder in subsequent iterations.
mov rdx, r14 ; low half of 128-bit argument "b" (divisor) for __udivti3.
xor ecx, ecx ; high half of argument "b". Top 32 bits are known zero from earlier in the function.
call __udivti3 ; rax = a / b
mov rdi, qword [rbp+var_38] ; words.count.
mov rcx, rax
mul r14 ; Quotient * divisor (the function argument, not the argument to __udivti3.
mov rsi, r13 ; The aforementioned next word from `words`.
sub rsi, rax ; Remainder = current word - (Quotient * divisor)
mov qword [r15+r12*8-8], rcx ; Save the quotient over the current word in `words`.
mov r12, rbx ; Just shifting the current index around registers.
cmp rdi, rbx ; Check that index is != words.startIndex.
jne loc_19a0 ; Repeat this whole block while it isn't.
I don't see how that could be faster than using __udivmodti4
, since the latter calculates the remainder no matter what - it merely drops it silently if its 'rem' out argument is NULL
, as it is in the use via __udivti3
. Weird. Seems like an optimiser bug.
Oh, neat. I appreciate your feedback. I'll address each point in time. Note The algorithm is fundamentally sign-magnitude, and it would be simpler to call |
@swift-ci please test |
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Some comments w.r.t. expecting Swift.BinaryInteger semantics plus some formatting. Co-authored-by: Wade Tregaskis <[email protected]> Co-authored-by: Tina Liu <[email protected]>
The withUnsafeTemporaryAllocation(of:capacity:) makes no guarantees about the size of the allocated buffer pointer, even though the current implementation covers exactly the requested number of elements.
Hm. The documentation of Swift.withUnsafeTemporaryAllocation(of:capacity:) states:
Does that speak of the buffer in the closure, or is it just implementation details leaking through documentation? It sounds like it refers to the buffer in the closure because that's how Swift uses the term capacity elsewhere: for capacity in 0 ... 32 {
String(unsafeUninitializedCapacity: capacity) {
print("capacity: \(capacity), count: \($0.count)")
precondition($0.count == Swift.max(16, capacity))
return 0
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glessard Do you want to take a look, specifically about the usage around the unsafemutableBufferPointer
? And also if we should consider sinking these into stdlib?
|
||
// Add a minus sign to negative values. | ||
if isLessThanZero { | ||
ascii.formIndex(before: &asciiIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my information how do we know that asciiIndex
is guaranteed to be >= ascii.startIndex
, and this surely doesn't go out of bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't see your question earlier. The capacity
is given by the following expression so please have a look at number-of-decimal-digits-in-a-binary-integer.
let capacity = maxDecimalDigitCountForUnsignedInteger(bitWidth: words.count * UInt.bitWidth) + (isLessThanZero ? 1 : 0)
Since capacity
is only an upper bound, I've used the fullest possible width as an uncomplicated approximaton (it can be trimmed a bit). In the loop, we then divide by 10
repeatedly and write each digit back-to-front. The loop is on repeat
, so we write at least one. The minimum capacity is 1
when words.count == 0
, and 1
is added when isLessThanZero == true
.
let isLessThanZero = isSigned && Int(bitPattern: words.last ?? 0) < 0 // tests the last bit when isSigned == true
Important edge case thing-y
The loop branchlessly decrements asciiIndex
by radix.exponent
(the digits written plus zeros for padding) each iteration, so it might be negative at the end of the loop. This is one reason why we set it to writeIndex
afterwards, the other is trimming the last iteration's padding. This assumes Index == Int
(alternatives exist).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, no index goes out of bounds or assumes typing in v2. I've also replaced all assertions with preconditions since the difference in performance was insignificant.
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Outdated
Show resolved
Hide resolved
Sources/FoundationInternationalization/Formatting/Number/BinaryInteger+FormatStyle.swift
Show resolved
Hide resolved
@swift-ci please test |
I believe it means that It makes sense as such. In the It turns out the current implementation of For Talking in 64-bit here. 32-bit the same principles apply but the numbers might be different. |
I have added a signed and unsigned tiny integer benchmark to see whether fast pathing is worth it. I've also measured an Note I've changed to Xcode 15.0.1, so everything is faster. |
I think this is a leak of implementation details, coupled with an API documentation deserving some improvements. |
Quick update: I'm working on a new version that should be speedy for small integers. I'll try to publish it within the next couple of days. It looks promising. It involves returning a String instead of an ArraySlice, so I hope that's OK. The String's ASCII sequence can be accessed using |
- I changed the return type from ArraySlice to String. - I changed the ICU number formatter's parameter from ArraySlice to String. - I changed the structure, although it's similar to `v1`. - I added `nextUp` to `log10(2)` because it's not obvious that it rounds up (it does).
I'm back! I've updated the pull request to SummaryThe return type is String, and tiny integers now beat the standard library when Changes
CommentsThe ICU format is undocumented, and |
As a side note, I saw that BinaryInteger's extension BinaryInteger {
/// Format `self` using `IntegerFormatStyle()`
public func formatted() -> String {
IntegerFormatStyle().format(Int(self))
}
} |
I have added the 1,000,000th Fibonacci sequence element (210k decimal digits) to the table 😃 |
@swift-ci please test |
- Slicing (rebasing) the magnitude per review comments. - Changed division parameter to buffer from buffer slice. - Changed assertions to preconditions because the difference in performance is small. - UnsafeMutableBufferPointer.init(start:count:) takes an optional (no force unwrapping).
I sense no enthusiasm—and nobody has made it a priority—should I close it? I know has a habit of keeping things in limbo, but I don't. I'd much rather cross this PR off my list and go somewhere my contributions are appreciated 🤷♂️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delayed reply. I personally do want to take this one ✌️
@swift-ci please test
@swift-ci please test |
Salutations! I come in peace ✌️
Note
This pull request builds on an earlier contribution.
Description
These changes aim to achieve two things. The first is making the binary integer text encoding algorithm non-generic so that models such as DoubleWidth perform well. The second is performing mutating division, making arbitrary precision faster. Copying the binary integer's words to a temporary allocation achieves both. Since the new algorithm takes a collection of words, it can also test big integers without big integer models.
Measurements
I have published a small package with some benchmarks.
v0
: theold
versionv0x
: theold
version +@inlinable
v1
: the1st
versionv1x
: the1st
version +@inlinable
(2x) +@usableFromInline
(1x)v2
: the2nd
versionv2x
: the2nd
version +@inlinable
(2x) +@usableFromInline
(1x)The results on MacBook Pro (13-inch, M1, 2020), Xcode 15.0.1, in seconds