Implement @depositBits and @extractBits #23474

bnuuydev · 2025-04-05T21:14:37Z

This PR takes #18680, updates the changes to master, and fixes the outstanding issues which prevented it from being merged.

There currently are outstanding issues related to the COFF linker which are blocking progress in the x86 backend:

x86 backend: 128-bit integers returned in xmm0 registers by windows 64 function cause compiler crash #19498 causes causing calls to the compiler-rt functions for u128 to crash the compiler when using the x86 backend to target Windows.
All calls to the compiler-rt functions for pext and pdep appear to invoke a bug in the COFF linker, causing these symbols to be linked from a non-existent DLL called .dll.

I have been told that moving on and just skipping the failing tests is the appropriate course of action here.

The comptime implementation calls into big.int.depositBits/extractBits, which themselves call the corresponding builtins. This allows for significantly better performance at comptime, but obviously requires zig1.wasm to be updated. I understand that a core member will be required to redo this commit for me before this can be merged?

The C backend implementation for integer sizes above u64 is blocked by #19991. I am subscribed to this issue, and once fixed, I have a work-in-progress implementation ready to be completed and included in a follow-up PR. Up-to u64, the C backend implementation works fine, and is sufficent for bootstrapping the compiler.

I've included a number of behaviour tests to ensure that the implementations are correct.

Many thanks to @Rexicon226, @jacobly0 and others for your extensive help :)

Closes #14995

This commit encompasses the progress made towards the implementation of these builtins in ziglang#18680

Currently doesn't go beyond 64-bit integers due to implementation problems (see ziglang#19991), and that this has been implemented to enable bootstrapping the compiler, and so is sufficient for current purposes.

Necessary as the new @depositBits and @extractBits builtins are used by the compiler.

Re-implements std.math.big.int.{deposit,extract}Bits to use the relevant builtins. This should improve the speed of @depositBits and @extractBits at comptime by orders of magnitude on supported architectures. It also cleans up the code nicely :)

This code would cause an overflow in Sema, with a minimal reproduction: `@depositBits(1, 3 << 63)`. These implementations under big.int should be scrutinised for any other buffer overflows, although I believe this is the only one.

Adds randomly-generated test cases for the builtins. These cases have a number of different integer sizes. This has helped to uncover a few issues with the current implementation, which will be fixed in subsequent commits. Additionally skips tests for Windows with the x86 backend, as we are not providing support until progress is made on the self-hosted COFF linker.

This replaces the old (and broken) use of the old API with the new `select` API. Windows is not supported at all by this API, however the original old API implementation never worked properly anyway, and the correct course of action is to wait for the COFF linker to be fixed. TODO: This currently has a hacky copy-paste in compiler-rt. Wait for alexrp's PR. Co-authored-by: David Rubin <[email protected]>

bnuuydev · 2025-04-05T21:16:08Z

There's currently a copy-paste in compiler-rt waiting for the required functions to be moved to std/Target.zig. Once that change is made, I'll fix that commit up.

bnuuydev · 2025-04-05T21:19:40Z

There is scope for a follow-up PR to optimise the codegen for u128 and above to compile down to multiple invocations of pdep and pext. I intend to tackle that at a later date :)

jacobly0

There's probably more issues, but they will be easier to see with these more obvious things fixed first.

src/arch/x86_64/CodeGen.zig

jacobly0 · 2025-04-05T22:22:36Z

lib/std/zig/llvm/Builder.zig

+        .@"x86.bmi.pext.32" = .{
+            .ret_len = 1,
+            .params = &.{
+                .{ .kind = .{ .type = .i32 } },
+                .{ .kind = .{ .type = .i32 } },
+                .{ .kind = .{ .type = .i32 } },
+            },
+            .attrs = &.{ .nocallback, .nofree, .nosync, .nounwind, .{ .memory = Attribute.Memory.all(.none) } },
+        },


Suggested change

.@"x86.bmi.pext.32" = .{

.ret_len = 1,

.params = &.{

.{ .kind = .{ .type = .i32 } },

.{ .kind = .{ .type = .i32 } },

.{ .kind = .{ .type = .i32 } },

},

.attrs = &.{ .nocallback, .nofree, .nosync, .nounwind, .{ .memory = Attribute.Memory.all(.none) } },

},

.@"x86.bmi.pext = .{

.ret_len = 1,

.params = &.{

.{ .kind = .overloaded },

.{ .kind = .{ .matches = 0 } },

.{ .kind = .{ .matches = 0 } },

},

.attrs = &.{ .nocallback, .nofree, .nosync, .nounwind, .willreturn, .{ .memory = .all(.none) } },

},

etc.

Oh right, these don't follow the correct pattern, I completely missed the missing i...

Besides the other changes here, aren't you right to add .willreturn though?

Addresses requested changes in PR.

bnuuydev · 2025-04-06T00:51:03Z

I think I've addressed those all now :) I'll take a look at any feedback you've given tomorrow. Thank you for the help so-far! Also it'd probably be a good idea to give the workflows approval in-case any tests aren't passing on platforms I haven't been able to test for :) (Although I imagine you don't want to be running a third-party zig1.wasm on the CI?)

bnuuydev · 2025-04-06T13:49:40Z

Fixed the faulty assumption that the source and mask would have the same number of limbs in the big.int implementation. Also added a couple more tests to confirm we have the correct behaviour.

Overflow arithmetic and {deposit,extract}Bits can share the same function to generate their builtin calls.

alexrp · 2025-04-10T08:50:01Z

lib/compiler_rt/pdeppext.zig

+    return pext_uX(u128, source, mask);
+}
+
+// BEGIN HACKY CODE COPY WAIT FOR ALEXRP PR


jacobly0

The x86_64 backend impl is looking much more solid, so I looked around a bit more at the other files.

jacobly0 · 2025-04-11T09:31:55Z

src/arch/x86_64/CodeGen.zig

+                            },
+                            .dst_temps = .{ .{ .mut_rc = .{ .rc = .general_purpose, .ref = .src1 } }, .unused },
+                            .each = .{
+                                .once = if (mir_tag == .pext) &.{


These are begging to be switches with explicit mir tags in each case for clarity.

jacobly0 · 2025-04-11T09:50:12Z

src/codegen/llvm.zig

+                const needs_extend = bits != compiler_rt_bits;
+                const extended_ty = if (needs_extend) try o.builder.intType(compiler_rt_bits) else llvm_ty;
+
+                const params = .{
+                    if (needs_extend) try self.wip.cast(.zext, source, extended_ty, "") else source,
+                    if (needs_extend) try self.wip.cast(.zext, mask, extended_ty, "") else mask,
+                };
+
+                const result = try self.wip.callIntrinsic(
+                    .normal,
+                    .none,
+                    intrinsic,
+                    &.{},
+                    &params,
+                    "",
+                );
+
+                return if (needs_extend) try self.wip.cast(.trunc, result, llvm_ty, "") else result;


cast already handles this logic.

Suggested change

const needs_extend = bits != compiler_rt_bits;

const extended_ty = if (needs_extend) try o.builder.intType(compiler_rt_bits) else llvm_ty;

const params = .{

if (needs_extend) try self.wip.cast(.zext, source, extended_ty, "") else source,

if (needs_extend) try self.wip.cast(.zext, mask, extended_ty, "") else mask,

};

const result = try self.wip.callIntrinsic(

.normal,

.none,

intrinsic,

&.{},

&params,

"",

);

return if (needs_extend) try self.wip.cast(.trunc, result, llvm_ty, "") else result;

const extended_ty = try o.builder.intType(compiler_rt_bits);

const result = try self.wip.callIntrinsic(

.normal,

.none,

intrinsic,

&.{

try self.wip.cast(.zext, source, extended_ty, ""),

try self.wip.cast(.zext, mask, extended_ty, ""),

},

&params,

"",

);

return try self.wip.cast(.trunc, result, llvm_ty, "");

jacobly0 · 2025-04-11T09:54:29Z

src/codegen/llvm.zig

+            else => {},
+        }
+
+        return try self.genDepositExtractBitsEmulated(tag, bits, source, mask, llvm_ty);


Why does this function exist? It only has one call site and seems to just want most of the locals by the same name.

jacobly0 · 2025-04-11T09:59:36Z

lib/zig.h

+            mask &= ~bit;\
+            uint##w##_t source_bit = source & bit;\
+            if (source_bit != 0) result |= bb;\
+            bb += bb;\


No need to obscure that you are doing bit manipulation.

Suggested change

bb += bb;\

bb <<= 1; \

jacobly0 · 2025-04-11T10:08:35Z

lib/zig.h

+        \
+        while (mask != 0) {\
+            uint##w##_t bit = mask & ~(mask - 1);\
+            mask &= ~bit;\


No need to extend this dependency chain when you can reuse an earlier expression.

Suggested change

mask &= ~bit;\

mask &= mask - 1; \

jacobly0 · 2025-04-11T10:28:49Z

lib/std/math/big/int.zig

+        assert(mask.positive);
+
+        result.positive = true;
+        @memset(result.limbs, 0);


The running time of these functions should not depend on how oversized result might be or one of source and mask relative to the other.

May you clarify this please?

You could pass in a result with a billion limbs, and the @memset would require each one to be set to 0 and then the normalize at the end would have to iterate each and every one to make sure they are still 0. Instead, you should only access the prefix of the limbs that might possibly be nonzero, which certainly does not exceed the number of limbs in the mask, among other constraints.

I see, I guess we could memset as we iterate the limbs instead then, so we only memset the limbs we actually write to? Or we could memset based on the length if the source or mask? Not sure which approach would be preferable here, but I like the first?

In fact for depositBits a memset isn't necessary at all we can just use = instead of |= on the limb.

Both impls only ever access the limbs monotonically, so you just store the current limb in a local variable, and then when you need to move on to the next limb, you store the local into the next element and track how many initialized elements there are, and then the normalize only ever has to check the actually initialized limbs for being zero.

jacobly0 · 2025-04-11T10:34:13Z

lib/std/math/big/int.zig

+
+            var source_limb = source.limbs[shift_limbs] >> shift_bits;
+            if (shift_bits != 0 and shift_limbs + 1 < source.limbs.len) {
+                source_limb += source.limbs[shift_limbs + 1] << @intCast(limb_bits - shift_bits);


Again, no need to hide bit manipulation.

Suggested change

source_limb += source.limbs[shift_limbs + 1] << @intCast(limb_bits - shift_bits);

source_limb |= source.limbs[shift_limbs + 1] << @intCast(limb_bits - shift_bits);

jacobly0 · 2025-04-11T10:42:34Z

lib/std/math/big/int.zig

+            shift += @intCast(@popCount(mask_limb));
+        }
+
+        result.normalize(result.limbs.len);


Again, you have a much tighter bound on the number of possible non-zero limbs than this unbounded input.

jacobly0 · 2025-04-11T10:47:42Z

lib/std/math/big/int.zig

+
+            result_limb.* |= pdep_limb;
+
+            shift += @intCast(@popCount(mask_limb));


Surely we can assume that @popCount of a usize fits in a usize.

Suggested change

shift += @intCast(@popCount(mask_limb));

shift += @popCount(mask_limb);

jacobly0 · 2025-04-11T10:52:13Z

lib/std/math/big/int_test.zig

@@ -2960,6 +2960,59 @@ fn popCountTest(val: *const Managed, bit_count: usize, expected: usize) !void {
    try testing.expectEqual(expected, val.toConst().popCount(bit_count));
 }

+test "big int extractBits" {


Based on how buggy the other tested big int functions have been, this needs more testing for being passed non-normalized inputs, producing normalized outputs, and using a result with fewer limbs than the source and/or mask.

Something like the gcd and bitwise or tests?

There's definitely room for improvement here relative to the existing tests, for example an isNormalized function and an expect that the return of any function is true, but anything that catches possible bugs is good enough for now.

bnuuydev · 2025-04-11T23:20:40Z

Currently away from my computer, will get these changes addressed in a week or two :)

bnuuydev and others added 11 commits April 5, 2025 20:19

Implement @depositBits and @extractBits

97844b2

This commit encompasses the progress made towards the implementation of these builtins in ziglang#18680

llvm: fix @{deposit,extract}Bits on x86-64 windows

4d81827

c-backend: add @{deposit,extract}Bits support

0610251

Currently doesn't go beyond 64-bit integers due to implementation problems (see ziglang#19991), and that this has been implemented to enable bootstrapping the compiler, and so is sufficient for current purposes.

stage1: update zig1.wasm

de12012

Necessary as the new @depositBits and @extractBits builtins are used by the compiler.

Add test cases for @depositBits compile errors

5cc77fe

bigint: fix overflow in depositBits

7e59fb8

This code would cause an overflow in Sema, with a minimal reproduction: `@depositBits(1, 3 << 63)`. These implementations under big.int should be scrutinised for any other buffer overflows, although I believe this is the only one.

sema: improve error wording

1b1b06c

Update langref

b9923bd

jacobly0 requested changes Apr 5, 2025

View reviewed changes

x86-backend: various deposit/extract fixes

56e2448

Addresses requested changes in PR.

big.int.depositBits: fix handling of different limb counts

39444f2

bnuuydev force-pushed the pdep-pext branch from 23cb573 to 39444f2 Compare April 6, 2025 13:47

bnuuydev added 3 commits April 7, 2025 19:45

Remove misleading comment after big.int rewrite

fa49606

AstGen: remove duplicate function

795647b

Overflow arithmetic and {deposit,extract}Bits can share the same function to generate their builtin calls.

Sema: fix depositInt/extractInt comments

acb7c8c

alexrp reviewed Apr 10, 2025

View reviewed changes

lib/compiler_rt/pdeppext.zig

return pext_uX(u128, source, mask);

}

// BEGIN HACKY CODE COPY WAIT FOR ALEXRP PR

Copy link

Member

alexrp Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23483

jacobly0 requested changes Apr 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement @depositBits and @extractBits #23474

Implement @depositBits and @extractBits #23474

bnuuydev commented Apr 5, 2025

bnuuydev commented Apr 5, 2025

bnuuydev commented Apr 5, 2025

jacobly0 left a comment

jacobly0 Apr 5, 2025 •

edited

Loading

jacobly0 Apr 5, 2025 •

edited

Loading

bnuuydev Apr 6, 2025

bnuuydev commented Apr 6, 2025 •

edited

Loading

bnuuydev commented Apr 6, 2025

alexrp Apr 10, 2025

jacobly0 left a comment

jacobly0 Apr 11, 2025

jacobly0 Apr 11, 2025

jacobly0 Apr 11, 2025

jacobly0 Apr 11, 2025

jacobly0 Apr 11, 2025

jacobly0 Apr 11, 2025

bnuuydev Apr 11, 2025

jacobly0 Apr 11, 2025 •

edited

Loading

bnuuydev Apr 11, 2025

bnuuydev Apr 11, 2025

jacobly0 Apr 12, 2025

jacobly0 Apr 11, 2025

jacobly0 Apr 11, 2025

jacobly0 Apr 11, 2025

jacobly0 Apr 11, 2025

bnuuydev Apr 11, 2025

jacobly0 Apr 11, 2025

bnuuydev commented Apr 11, 2025

	source_limb += source.limbs[shift_limbs + 1] << @intCast(limb_bits - shift_bits);
	source_limb \|= source.limbs[shift_limbs + 1] << @intCast(limb_bits - shift_bits);


		result_limb.* \|= pdep_limb;

		shift += @intCast(@popCount(mask_limb));

	shift += @intCast(@popCount(mask_limb));
	shift += @popCount(mask_limb);

Implement @depositBits and @extractBits #23474

Are you sure you want to change the base?

Implement @depositBits and @extractBits #23474

Conversation

bnuuydev commented Apr 5, 2025

bnuuydev commented Apr 5, 2025

bnuuydev commented Apr 5, 2025

jacobly0 left a comment

Choose a reason for hiding this comment

jacobly0 Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

jacobly0 Apr 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnuuydev commented Apr 6, 2025 • edited Loading

bnuuydev commented Apr 6, 2025

Choose a reason for hiding this comment

jacobly0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobly0 Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnuuydev commented Apr 11, 2025

jacobly0 Apr 5, 2025 •

edited

Loading

jacobly0 Apr 5, 2025 •

edited

Loading

bnuuydev commented Apr 6, 2025 •

edited

Loading

jacobly0 Apr 11, 2025 •

edited

Loading