Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 128-bit SIMD implementation for LoongArch #592

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

heiher
Copy link
Contributor

@heiher heiher commented Nov 29, 2024

No description provided.

@heiher
Copy link
Contributor Author

heiher commented Nov 29, 2024

Benchmarks

Generic

test clone_from_large               ... bench:      16,246.92 ns/iter (+/- 5.70)
test clone_from_small               ... bench:         164.88 ns/iter (+/- 0.12)
test clone_large                    ... bench:      16,299.96 ns/iter (+/- 6.84)
test clone_small                    ... bench:         177.26 ns/iter (+/- 0.38)
test grow_insert_foldhash_highbits  ... bench:      35,765.81 ns/iter (+/- 216.20)
test grow_insert_foldhash_random    ... bench:      40,165.09 ns/iter (+/- 169.35)
test grow_insert_foldhash_serial    ... bench:      37,015.20 ns/iter (+/- 133.06)
test grow_insert_std_highbits       ... bench:      66,583.71 ns/iter (+/- 225.80)
test grow_insert_std_random         ... bench:      67,088.86 ns/iter (+/- 276.24)
test grow_insert_std_serial         ... bench:      66,307.29 ns/iter (+/- 223.76)
test insert_erase_foldhash_highbits ... bench:      48,908.50 ns/iter (+/- 105.22)
test insert_erase_foldhash_random   ... bench:      50,601.87 ns/iter (+/- 49.88)
test insert_erase_foldhash_serial   ... bench:      48,916.41 ns/iter (+/- 290.09)
test insert_erase_std_highbits      ... bench:      74,941.33 ns/iter (+/- 72.91)
test insert_erase_std_random        ... bench:      76,867.05 ns/iter (+/- 74.93)
test insert_erase_std_serial        ... bench:      74,572.08 ns/iter (+/- 126.56)
test insert_foldhash_highbits       ... bench:      34,951.05 ns/iter (+/- 60.47)
test insert_foldhash_random         ... bench:      34,288.96 ns/iter (+/- 30.60)
test insert_foldhash_serial         ... bench:      34,718.45 ns/iter (+/- 32.14)
test insert_std_highbits            ... bench:      41,719.58 ns/iter (+/- 24.72)
test insert_std_random              ... bench:      42,008.54 ns/iter (+/- 25.80)
test insert_std_serial              ... bench:      41,436.19 ns/iter (+/- 117.69)
test iter_foldhash_highbits         ... bench:       1,518.42 ns/iter (+/- 3.76)
test iter_foldhash_random           ... bench:       1,529.57 ns/iter (+/- 1.72)
test iter_foldhash_serial           ... bench:       1,529.15 ns/iter (+/- 2.03)
test iter_std_highbits              ... bench:       1,525.97 ns/iter (+/- 4.45)
test iter_std_random                ... bench:       1,530.63 ns/iter (+/- 3.31)
test iter_std_serial                ... bench:       1,533.28 ns/iter (+/- 0.99)
test lookup_fail_foldhash_highbits  ... bench:       4,073.06 ns/iter (+/- 13.61)
test lookup_fail_foldhash_random    ... bench:       5,019.63 ns/iter (+/- 14.34)
test lookup_fail_foldhash_serial    ... bench:       4,226.26 ns/iter (+/- 3.75)
test lookup_fail_std_highbits       ... bench:      17,037.26 ns/iter (+/- 25.19)
test lookup_fail_std_random         ... bench:      17,169.37 ns/iter (+/- 24.74)
test lookup_fail_std_serial         ... bench:      17,318.23 ns/iter (+/- 31.63)
test lookup_foldhash_highbits       ... bench:       5,150.79 ns/iter (+/- 45.27)
test lookup_foldhash_random         ... bench:       5,873.21 ns/iter (+/- 14.38)
test lookup_foldhash_serial         ... bench:       5,163.76 ns/iter (+/- 15.43)
test lookup_std_highbits            ... bench:      16,598.58 ns/iter (+/- 19.18)
test lookup_std_random              ... bench:      17,156.51 ns/iter (+/- 68.13)
test lookup_std_serial              ... bench:      16,593.22 ns/iter (+/- 31.29)
test rehash_in_place                ... bench:     256,138.54 ns/iter (+/- 3,908.63)
test insert                         ... bench:      12,309.82 ns/iter (+/- 42.69)
test insert_unique_unchecked        ... bench:       8,472.15 ns/iter (+/- 34.36)
test set_ops_bit_and                ... bench:       9,474.08 ns/iter (+/- 13.72)
test set_ops_bit_and_assign         ... bench:       5,843.68 ns/iter (+/- 3.27)
test set_ops_bit_or                 ... bench:      66,709.64 ns/iter (+/- 141.58)
test set_ops_bit_or_assign          ... bench:      48,139.63 ns/iter (+/- 58.48)
test set_ops_bit_xor                ... bench:      80,215.23 ns/iter (+/- 135.77)
test set_ops_bit_xor_assign         ... bench:      50,767.75 ns/iter (+/- 71.79)
test set_ops_sub_assign_large_small ... bench:      50,776.08 ns/iter (+/- 24.68)
test set_ops_sub_assign_small_large ... bench:       6,480.28 ns/iter (+/- 10.76)
test set_ops_sub_large_small        ... bench:      79,737.58 ns/iter (+/- 108.30)
test set_ops_sub_small_large        ... bench:       1,626.65 ns/iter (+/- 1.83)

LSX

test clone_from_large               ... bench:      17,020.11 ns/iter (+/- 12.55)
test clone_from_small               ... bench:         164.59 ns/iter (+/- 1.28)
test clone_large                    ... bench:      16,157.96 ns/iter (+/- 19.42)
test clone_small                    ... bench:         175.90 ns/iter (+/- 0.69)
test grow_insert_foldhash_highbits  ... bench:      38,915.42 ns/iter (+/- 211.70)
test grow_insert_foldhash_random    ... bench:      42,270.14 ns/iter (+/- 202.65)
test grow_insert_foldhash_serial    ... bench:      42,175.09 ns/iter (+/- 331.12)
test grow_insert_std_highbits       ... bench:      62,644.02 ns/iter (+/- 82.69)
test grow_insert_std_random         ... bench:      63,413.93 ns/iter (+/- 164.99)
test grow_insert_std_serial         ... bench:      62,606.00 ns/iter (+/- 242.10)
test insert_erase_foldhash_highbits ... bench:      52,709.44 ns/iter (+/- 85.18)
test insert_erase_foldhash_random   ... bench:      55,108.41 ns/iter (+/- 42.90)
test insert_erase_foldhash_serial   ... bench:      53,624.53 ns/iter (+/- 50.03)
test insert_erase_std_highbits      ... bench:      76,772.58 ns/iter (+/- 70.18)
test insert_erase_std_random        ... bench:      78,144.83 ns/iter (+/- 54.28)
test insert_erase_std_serial        ... bench:      75,835.83 ns/iter (+/- 60.52)
test insert_foldhash_highbits       ... bench:      34,887.89 ns/iter (+/- 68.06)
test insert_foldhash_random         ... bench:      34,145.26 ns/iter (+/- 29.63)
test insert_foldhash_serial         ... bench:      34,312.58 ns/iter (+/- 24.92)
test insert_std_highbits            ... bench:      43,286.27 ns/iter (+/- 124.62)
test insert_std_random              ... bench:      43,534.09 ns/iter (+/- 127.91)
test insert_std_serial              ... bench:      43,096.26 ns/iter (+/- 74.62)
test iter_foldhash_highbits         ... bench:       1,534.54 ns/iter (+/- 7.84)
test iter_foldhash_random           ... bench:       1,528.22 ns/iter (+/- 14.07)
test iter_foldhash_serial           ... bench:       1,527.97 ns/iter (+/- 15.71)
test iter_std_highbits              ... bench:       1,530.75 ns/iter (+/- 16.90)
test iter_std_random                ... bench:       1,533.75 ns/iter (+/- 7.11)
test iter_std_serial                ... bench:       1,530.39 ns/iter (+/- 7.44)
test lookup_fail_foldhash_highbits  ... bench:       3,791.03 ns/iter (+/- 2.41)
test lookup_fail_foldhash_random    ... bench:       4,083.82 ns/iter (+/- 4.12)
test lookup_fail_foldhash_serial    ... bench:       3,805.35 ns/iter (+/- 5.77)
test lookup_fail_std_highbits       ... bench:      15,570.95 ns/iter (+/- 18.51)
test lookup_fail_std_random         ... bench:      16,111.26 ns/iter (+/- 22.75)
test lookup_fail_std_serial         ... bench:      15,648.50 ns/iter (+/- 25.71)
test lookup_foldhash_highbits       ... bench:       4,791.65 ns/iter (+/- 13.93)
test lookup_foldhash_random         ... bench:       5,268.68 ns/iter (+/- 1.47)
test lookup_foldhash_serial         ... bench:       4,844.37 ns/iter (+/- 14.97)
test lookup_std_highbits            ... bench:      15,743.58 ns/iter (+/- 16.02)
test lookup_std_random              ... bench:      16,195.89 ns/iter (+/- 28.05)
test lookup_std_serial              ... bench:      15,824.06 ns/iter (+/- 10.40)
test rehash_in_place                ... bench:     292,026.75 ns/iter (+/- 3,391.74)
test insert                         ... bench:      15,071.76 ns/iter (+/- 61.64)
test insert_unique_unchecked        ... bench:       8,962.53 ns/iter (+/- 43.17)
test set_ops_bit_and                ... bench:       9,457.59 ns/iter (+/- 18.13)
test set_ops_bit_and_assign         ... bench:       5,680.76 ns/iter (+/- 2.40)
test set_ops_bit_or                 ... bench:      67,655.00 ns/iter (+/- 119.62)
test set_ops_bit_or_assign          ... bench:      46,750.12 ns/iter (+/- 55.45)
test set_ops_bit_xor                ... bench:      83,518.35 ns/iter (+/- 139.27)
test set_ops_bit_xor_assign         ... bench:      49,245.48 ns/iter (+/- 26.45)
test set_ops_sub_assign_large_small ... bench:      49,263.67 ns/iter (+/- 43.38)
test set_ops_sub_assign_small_large ... bench:       6,511.88 ns/iter (+/- 9.04)
test set_ops_sub_large_small        ... bench:      79,666.67 ns/iter (+/- 167.73)
test set_ops_sub_small_large        ... bench:       1,524.12 ns/iter (+/- 0.57)

/// Returns a `BitMask` indicating all tags in the group which are full.
#[inline]
pub(crate) fn match_full(&self) -> BitMask {
self.match_empty_or_deleted().invert()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be faster to use lsx_vmskgez_b here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would definitely be faster to use lsx_vmskgez_b here. Good catch!

@heiher
Copy link
Contributor Author

heiher commented Nov 29, 2024

Blocked by rust-lang/rust#133249

} else if #[cfg(all(
target_arch = "loongarch64",
target_feature = "lsx",
not(miri),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be under the "nightly" feature until loongarch intrinsics are stabilized.

Copy link
Contributor

@clarfonthey clarfonthey Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also going to mention this-- that way loongarch still works on stable. Although it would also be nice to have a new release so this can be used for the libstd implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It's done.

Comment on lines +33 to +44
pub(crate) const fn static_empty() -> &'static [Tag; Group::WIDTH] {
#[repr(C)]
struct AlignedTags {
_align: [Group; 0],
tags: [Tag; Group::WIDTH],
}
const ALIGNED_TAGS: AlignedTags = AlignedTags {
_align: [],
tags: [Tag::EMPTY; Group::WIDTH],
};
&ALIGNED_TAGS.tags
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the outcome of #596 you may want to add that change here as well.

@bors
Copy link
Contributor

bors commented Dec 9, 2024

☔ The latest upstream changes (presumably #597) made this pull request unmergeable. Please resolve the merge conflicts.

@heiher heiher force-pushed the loong-lsx branch 2 times, most recently from fdeb358 to 4e19405 Compare December 13, 2024 04:00
@heiher
Copy link
Contributor Author

heiher commented Dec 13, 2024

Blocked by rust-lang/rust#133249

CI is green now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants