Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed IntegerFormatStyle for values > Int64.max. #262

Merged
merged 52 commits into from
Oct 31, 2023

Conversation

wadetregaskis
Copy link
Contributor

If I understand the NumberFormatter implementation in the ICU library, this should handle arbitrary-precision numbers, as long as they have a description which returns their POSIX-formatted representation, although that's not in the unit tests because it requires a working arbitrary-precision type, and I assume it's not acceptable to pull in a BigInt package for this purpose, and my efforts to make a quick-and-dirty one for testing purposes led only to tears and threats against Xcode's life.

If I understand the NumberFormatter implementation in the ICU library, this _should_ handle arbitrary-precision numbers, as long as they have a `description` which returns their POSIX-formatted representation, although that's not in the unit tests because it requires a working arbitrary-precision  type, and I assume it's not acceptable to pull in a BigInt package for this purpose, and my efforts to make a quick-and-dirty one for testing purposes led only to tears and threats against Xcode's life.
@itingliu
Copy link
Contributor

@swift-ci please test

1 similar comment
@itingliu
Copy link
Contributor

@swift-ci please test

@itingliu
Copy link
Contributor

To be precise, ICU requires the string to be a "numeric string" as defined here: https://speleotrove.com/decimal/daconvs.html#refnumsyn

Ideally we would want to implement our own if we are to follow the spec, but realistically I doubt swift's implementation of description will ever change, so it'd probably work fine. Using description however doesn't feel right... I'm curious to hear what others think. @parkera @glessard ?

@parkera
Copy link
Contributor

parkera commented Sep 20, 2023

We've certainly been known to change descriptions for various reasons over the years, so I would vote for using a dedicated internal function to be clear what the requirements are.

@wadetregaskis
Copy link
Contributor Author

Ideally we would want to implement our own if we are to follow the spec, but realistically I doubt swift's implementation of description will ever change, so it'd probably work fine. Using description however doesn't feel right... I'm curious to hear what others think. @parkera @glessard ?

It's not just what the built-in integer types do for description, it's 3rd party types too. My reason for doing this, for example, is because I want to use arbitrary-precision integers (via BigInt) while still having usable display in GUIs (because big numbers especially need things like thousands separators).

There are many 'BigInt' packages for Swift, and while I didn't exhaustively check them, I did look at most and they all do render the desired output from their description (because none of them remotely try to produce localised or "nicely" formatted results, they all do the same basic thing of just splatting out the latin decimal digits).

I did initially intend to write a BinaryInteger-to-Unicode-number-string function for this, but deferred it given:

  • All my examples already produce the right output.
  • I don't know what BinaryInteger.description intends to do as-is (e.g. maybe it's already defined to output in the desired format? Sounds like no, though, from @parkera's comment).
  • I wasn't sure how much effort it'd be to write an efficient implementation.
  • I couldn't conclusively locate the actual format specification, from ICU. They refer vaguely to some other spec document, which doesn't seem to contain a clear number format specification.

If everyone's generally happy with this change otherwise, then I'm happy to dig into this last aspect and address that too. I just didn't want to do that work pre-emptively without knowing if this fix would be welcome to begin with.

@itingliu
Copy link
Contributor

itingliu commented Sep 20, 2023

Thinking more about this I think it'd be more reasonable if we implement a string following the spec of on our own

The problem with relying on .description is that there's no API contract for what that returns, so the fact that it works is pure coincidence. On the contrary we have all the internal values for BinaryInteger and BinaryFloatingPoint thanks to the protocol requirement, which would allow us to implement such function reliably.

I wasn't sure how much effort it'd be to write an efficient implementation.

I understand this concern, but I think it's totally fine to start with something that works intuitively, and defer non-obvious optimization later if needed. I'm all for avoiding unnecessary premature optimization

I couldn't conclusively locate the actual format specification, from ICU. They refer vaguely to some other spec document, which doesn't seem to contain a clear number format specification.

https://speleotrove.com/decimal/daconvs.html#refnumsyn this is the spec. I agree that it's unclear, but in my opinion it's not much different from how opaque .description is

I just didn't want to do that work pre-emptively without knowing if this fix would be welcome to begin with.

Yes we do welcome changes!

…tStyle`.

This patch introduces an explicit `numericStringRepresentation` property for `BinaryInteger`, that can be relied on to produce exactly the decimal string expected by the ICU library rather than relying on `BinaryInteger`'s default `description` (which happens to produce the same format today but isn't guaranteed to in future, and can be overridden by 3rd party implementations of `BinaryInteger`).

This uses a non-trivial algorithm, nominally for performance although it has not been benchmarked.  The trivial (relatively-speaking) algorithm is to just iteratively divide `self` by 10, which produces one digit of the result per iteration.  This is in fact what is done for all values that fit into a natural machine word (`UInt`), as a fast path optimisation.  But for `BinaryInteger`s which are larger than a machine word - and especially for those that have unbounded size, such as arbitrary-precision integers, dividing that entire number by 10 may be quite expensive.  So a potential optimisation - as used in this patch - is to leverage the fact that every `BinaryInteger` is defined in terms of one or more `UInt`s, by (essentially) processing a whole `UInt` at a time.  There's still a divide-by-ten loop involved per UInt, but integer division on a natural machine word is likely much faster than those same divisions on the full arbitrary-precision number.
@wadetregaskis
Copy link
Contributor Author

Sorry for the delay - I installed Sonoma on my Macs last week and it took pretty much the entire week to repair the damage it did. That's still a work in progress, but I can at least get Xcode to launch again, so I've picked this up.

I do have a working implementation of the additional changes, to not rely on description.

Unfortunately I rebased and I'm now blocked by the fact that top of tree doesn't build anymore (even before I reapply my patches). 1,513 errors, things like "'Predicate' is only available in macOS 9999 or newer" and "CircularReference". I'll wait until those are all fixed before resuming my work.

@itingliu
Copy link
Contributor

Unfortunately I rebased and I'm now blocked by the fact that top of tree doesn't build anymore (even before I reapply my patches). 1,513 errors, things like "'Predicate' is only available in macOS 9999 or newer" and "CircularReference". I'll wait until those are all fixed before resuming my work.

Thanks for bringing this up to our attention! We're going to resolve this soon

@parkera
Copy link
Contributor

parkera commented Sep 26, 2023

Unfortunately I rebased and I'm now blocked by the fact that top of tree doesn't build anymore (even before I reapply my patches). 1,513 errors, things like "'Predicate' is only available in macOS 9999 or newer" and "CircularReference". I'll wait until those are all fixed before resuming my work.

Thanks for bringing this up to our attention! We're going to resolve this soon

#271

…may be faster.

It depends on how the compiler treats the literal - e.g. it might already replace that with a simple reference to the `.zero` constant - but this way removes the possibility of an inefficiency (e.g. actually creating a duplicate instance of the zero constant).
@wadetregaskis
Copy link
Contributor Author

Does this project support 32-bit platforms (specifically, where UInt is 32-bit)?

…h standalone test cases.

Technically not an issue when they're nested inside test cases, but still potentially confusing to a human.  Plus, this way they can be moved up a level without having to be renamed (if later tests would like to use the same helpers).
…l (from `test`), moved it up a level, and changed its `magnitude` parameter to be `UInt`.

This way it can be used from additional test cases in future.

The parameter type change is helpful for any future tests that want to use this with types that might not be initialisable from the necessary integer literal (e.g. an arbitrary-precision integer which doesn't support ExpressibleByIntegerLiteral from StaticBigInt and is thus limited to Int, yet its word magnitude will still be UInt as required by the BinaryInteger protocol and its word magnitude value will not fit into an Int).
… rather than simple assignment via subscripting, as the latter mysteriously crashes once the buffer gets large enough.

I wasn't able to precisely root-cause the crash, but it does seem like it's something to do with ContiguousArray's assignment operator implementation assuming `self` is valid, which of course it isn't (yet) in this case.  I didn't explore how `initializeElement(at:to)` does the (figuratively) same assignment without running afoul of this, just observed that it evidently does.
…ded.

The previous algorithm worked pretty amazingly well, considering how baseless it was in mathematical correctness (it was inspired by examples found in _many_ arbitrary-precision numerics libraries, suggesting many people before me also cargo-culted the calculation - but at least I *checked* it).

It took me a while to find an example case that would actually break it (using a BigInt library) and I wasn't able to find an example that was less than hundreds of digits.  I'm sure there's a much smaller value that would expose the problem with the prior code, I just didn't feel it a valuable exercise to go full maths nerd in order to find it.
@wadetregaskis
Copy link
Contributor Author

Getting there. I believe I've address all the correctness issues for positive numbers, but there's at least one remaining bug for negative.

Also, this surely won't work if UInt is 32-bit, but only because I've hard-coded the assumption of 64-bitness into the unit tests (the non-test code should work fine in any bitness - I was careful not to hard-code any assumptions there). I don't have any way (that I know of) to actually test the project in 32-bit mode.

I also still want to optimise the output generation - it's currently using a little over twice as much memory as it has to, due to the two-phase process of first assembling the 'word' blocks in reverse order, then glueing them together. I think I can combine those two (possibly making the code a little simpler overall).

…MagnitudePerWord on a real BigInt implementation.
This doesn't make the code shorter per-se but it does make it a bit simpler in most respects.  It does make the buffer indexing slightly more complicated.

But the benefit, of course, is that it no longer allocates twice as much memory as actually needed, and does less work at runtime to assemble the result.

Because the algorithm emits the resulting digits in reverse order (least significant to most), the buffers get filled from end to front.  Since the estimate of required buffer size is only an upper bound, not a precise value, that means sometimes there's some space at the front of the buffer that ends up unused.  Rather than doing a memmove to shunt the contents of the buffer up to its front - which would be pretty expensive for non-trivially-sized numbers - the `numericStringRepresentation` property now simply returns an `ArraySlice` instead of a `ContiguousArray`.  The callers don't care - same interface as far as they're concerned - and they quickly dispose of the `ArraySlice` which in turn disposes off the underlying `ContiguousArray` anyway.
@wadetregaskis
Copy link
Contributor Author

@swift-ci please test

@wadetregaskis
Copy link
Contributor Author

Close to finished now. I still need to do a final review and clean up any rough edges or outdated comments.

I added a dependency on attaswift/BigInt in order to unit test against a real arbitrary-precision integer. I kept that to a single patch so it's easy to exclude or revert - though that dependency is only used for testing purposes, I don't know if it's considered kosher to include it.

The unit tests still only work in 64-bit, too. Let me know if that needs to change.

…ned (it was one too high).

There _was_ actually a reason for this, originally, I vaguely recall… I just don't know now why.  What it was (before this patch) returning was actually the _maximum_ number of decimal digits for values of the given type, which is _not_ what it's actually needed for.
…tension type - now just "UInt" outright.

Of course this is equivalent to the prior extension point of "BinaryInteger where Self == UInt".  I think I just wrote it that way because of the particular frame of reference I had at the time.  Thankfully testing shows that the compiler doesn't require that more obtuse way of writing "UInt" in order to correctly detect and use the specialisation.
… the result directly using pretty trivial maths.

This does require doing a power operation, though, so it requires Darwin or equivalent (e.g. Glibc).  And since thats now required I removed the previously hard-coded constant for log10(2) in favour of just calling `log10`.  Although I'm not sure if the compiler is clever enough to simplify that to the actual value at compile time, to improve runtime performance. 🤔

It is possible to hard-code the guess values for a finite number of bit widths (of which we really only care about 32 & 64), which would avoid the maths and the depending on Darwin etc, but I couldn't find a way to do that which didn't result in the compiler complaining about unreachable code.  It only issues warnings in that case, but still, having pointless warnings forever attached to this code is unpalatable.
…nto the methods themselves.

The comments in question pertain to implementation details, that probably aren't of interest to mere users of these methods & properties.

At @iCharlesHu's request (swiftlang#262 (comment)).
…ndition, at @iCharlesHu's request (swiftlang#262 (comment)).

It should never trip as long as `UInt(exactly:)` and the implementation of the concrete type's words property aren't buggy.  If either is, then it's likely that the result of this method will be wrong anyway.  So the benefit of the check is probably just a faster and clearer failure.

But the downside of the precondition check seems quite small - slight increase in code size and a [hopefully very small] runtime cost.
…m the default of internal).

It's not used outside the file and is a fairly specific implementation detail of the `numericStringRepresentation` property, so it would be wise to discourage casual use.
…#262.

Arguably they should already have been there, to avoid exactly this kind of spurious noise when adding or removing dependencies, but, 🤷‍♂️.
Copy link
Contributor

@oscbyspro oscbyspro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I saw this pull request mentioned on the Swift forums and wanted to try it out. While doing so, I noticed that a debug assertion fails for fixed-width integers wider than UInt (see comment). Without this assertion, the rest works with my double-width integer in Numberick.

This is attempting to work around `bitWidth` not working properly (IMO) for FixedWidthIntegers (they return the bit width of the _type_, not the _actual value at hand_).

This is non-trivial to implement because signed numbers. 🙁

Specifically, the use of two's complement (for `BinaryInteger`'s underlying storage, in `words`) - the edge cases of -(every power of two) which fits into a signed type one bit smaller than +(every power of two).  So we need to detect powers of two - easy, right?  No.  We only have a `Numeric` value (for the `magnitude`), which is next to useless because it's _such_ a high-level protocol with very little API.

It _might_ be possible to make this work using just the `words`, by basically reinventing all the necessary arithmetic operations on `words`, but it'll be a bit hairy and duplicative of functionality already in the _actual_ `BinaryInteger` we're operating on (but blocked from us by the type system).

I'm checking this in for posterity - in case for some reason this turns out to be the only viable path - but I think all this is unnecessary… what we really want is the number of bits required to represent the _magnitude_, because that's what we're actually going to convert to a numeric string…
…on(intoEndOfBuffer:)` with an assert, at @glessard's request (swiftlang#262 (comment))."

We'd both overlooked the fact that it _can_ be called on zero because it's used for the individual words of the BinaryInteger, any of which can actually be zero.

72cd8f0
…iven `BinaryInteger`.

I resisted just deriving it from `words` at the outset because I thought surely that's naive; surely there's a more elegant if not more efficient way.  But there isn't, as far as I can tell.  And this actually isn't complicated nor all that expensive anyway, it turns out (at least for `BinaryInteger`s which natively store their values in two's complement form, and thus don't have to specially generate `words` on demand).

An important aspect of this is that what we care about is _not_ the two's complement form, which is what all the built-ins (e.g. `bitWidth`) nominally return, but that size of the _unsigned_ value that is the _magnitude_.  That's what we're actually working on; converting to its constituent decimal digits.  So the whole off-by-one problems of two's complement representation, with negative powers of two, etc, are all irrelevant really.
…, to cover fixed-width 128-bit and 256-bit integers.

These are conditionalised on the inclusion of Numberick in Package.swift, which is not the default.  Like the existing BigInt tests.

This proved useful as it did reveal a bug - one I'd actually discovered previously but apparently the fix got lost during Git patch shuffling or something - in `magnitudeBitWidth`, whereby it was forgetting to check if the current word is actually non-zero before treating it as relevant (meaning not just zero padding).  This wasn't caught by the BigInt tests because those BigInt / BigUInt are flexible-width and only use as many words as strictly necessary to contain the value.
@wadetregaskis
Copy link
Contributor Author

I added unit tests using Numberick (https://github.com/oscbyspro/Numberick) in the same vein as the ones using BigInt, which found one additional bug. Mostly the code doesn't care about fixed-width vs flexible-width but there is an underlying difference as the former may zero-pad words to the expected length, whereas the latter doesn't have any reason to (but also may have redundant, zero-padding words, I suppose - but attaswift/BigInt doesn't).

I knew what I meant when I wrote it - I got the actual description correct regarding the rounding mode - but I inverted the two letters.  T (for truncating, a la `.rounded(.towardZero)`) vs F (for flooring, a la `.rounded(.down)`).
…ap-around is intentional.

The comment might seem excessive, but in reviewing this I had already forgotten that I was using the ~0 + 1 == 0 trick, and I tried to replace '&+' with plain '+', which then crashes (thankfully I'd also had the foresight to make a unit test case which checks this exact situation).  So, for future readers who aren't even me, a comment might save them some time (or anguish). 🙂
@wadetregaskis
Copy link
Contributor Author

@iCharlesHu & @glessard, I think this is just waiting on your clarifications regarding your earlier comments. Can you take a quick look, please?

@wadetregaskis
Copy link
Contributor Author

So… should I drop this PR? It seems like it's not going anywhere. 😞

I'm not sure how best to put this; I don't want to offend… this was my first attempt to contribute to an Apple open source project, and I had quite some enthusiasm towards some further contributions, but now I think that would be a bad idea.

This is fundamentally a trivially bug fix, for something that was really obviously broken for a long time. It ended up being many days of work for me - which would have been fine, it was educational at least, iff it actually went through. Doing all that work to no actual avail is disheartening.

Copy link
Contributor

@itingliu itingliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @wadetregaskis I was absorbed in my own work and this just slipped my mind. This looks good to merge to me. I'm going to kick off another CI run and then I'll merge it

If @iCharlesHu or @glessard has further comments we can always address them in a follow up PR

@itingliu
Copy link
Contributor

@swift-ci please test

@parkera
Copy link
Contributor

parkera commented Oct 31, 2023

This is weird, but I doubt it's related to this PR:

[889/956] Compiling FoundationEssentials Predicate.swift
Swift/UnsafeRawPointer.swift:421: Fatal error: load from misaligned raw pointer
Current stack trace:

@parkera parkera merged commit 534a60d into swiftlang:main Oct 31, 2023
1 of 2 checks passed
@parkera
Copy link
Contributor

parkera commented Oct 31, 2023

Thank you so much @wadetregaskis! I really appreciate this contribution and the detailed work you put into it.

@stephentyrone
Copy link
Contributor

Yeah, I don't see anyway for this change to manifest that error.

oscbyspro added a commit to oscbyspro/swift-foundation that referenced this pull request Dec 11, 2023
- [IntegerFormatStyle] I removed the trapping conversion to `Int`. IntegerFormatStyle can format big integers since (swiftlang#262).
- [FloatingPointFormatStyle] I removed the rounding conversion to `Double`. `formatted()` now does whatever `format(_:)` does.
- [Decimal.FormatStyle] N/A (there were no conversions here).
itingliu pushed a commit that referenced this pull request Dec 21, 2023
* Improved integer and floating point `formatted()` methods.

- [IntegerFormatStyle] I removed the trapping conversion to `Int`. IntegerFormatStyle can format big integers since (#262).
- [FloatingPointFormatStyle] I removed the rounding conversion to `Double`. `formatted()` now does whatever `format(_:)` does.
- [Decimal.FormatStyle] N/A (there were no conversions here).

* Reenabled "Decimal Tests" in NumberFormatStyleTests.swift.

- IntegerFormatStyle big integer tests succeed.
- IntegerFormatStyle.Attributed big integer tests fail (output is clamped to `Int64`).

* Fixes IntegerFormatStyle.Attributed.

- Added a numeric string representation case to ICUNumberFormatter.Value.
- IntegerFormatStyle.Attributed now uses the above instead of `Int64(clamping:)`.

* Removed conversions to Decimal in each integer format style (#186).

BinaryInteger's numeric string representation supersedes Decimal in the following cases:

1. IntegerFormatStyle.
2. integerFormatStyle.Attributed.
3. IntegerFormatStyle.Currency.
4. IntegerFormatStyle.Percent.

* Check whether numeric string is zero using Double.

The numeric string format permits redundant zeros (like `+00.00`).

* Removed `isZero` and `doubleValue` from `ICUNumberFormatter.Value`.

Both `isZero` and `doubleValue` were used in `ByteCountFormatStyle`. These values are now taken from `FormatInput` (`Int64`) instead. Removing them from `ICUNumberFormatter.Value` makes it easier to accommodate non-numeric payloads such as strings, which can be used to format arbitrary precision numbers.

* Added `_format(_:doubleValue:)` to `ByteCountFormatStyle`.

Here's an internal method simliar to the `_format(_:)` method removed earlier because  wants its method back! I removed the first method to accommodate `ICUNumberFormatter.Value` cases that cannot implement `doubleValue`. The new method parameterizes the conversion instead, so you can call it whenever conversions to `Double` are possible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants