Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose algebraic floating point intrinsics #136457

Merged
merged 1 commit into from
Apr 5, 2025
Merged

Conversation

calder
Copy link
Contributor

@calder calder commented Feb 3, 2025

Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

C++: 10us ✅

With Clang 18.1.3 and -O2 -march=haswell:

C++ Assembly
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}

Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and -C opt-level=3 -C target-feature=+avx2,+fma:

Rust Assembly
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}

Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and -C opt-level=3 -C target-feature=+avx2,+fma:

Rust Assembly
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}

Proposed Change

Add core::intrinsics::f*_algebraic wrappers to f16, f32, f64, and f128 gated on a new float_algebraic feature.

Alternatives Considered

#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

References

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux

@rustbot
Copy link
Collaborator

rustbot commented Feb 3, 2025

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @thomcc (or someone else) some time within the next two weeks.

Please see the contribution instructions for more information. Namely, in order to ensure the minimum review times lag, PR authors and assigned reviewers should ensure that the review label (S-waiting-on-review and S-waiting-on-author) stays updated, invoking these commands when appropriate:

  • @rustbot author: the review is finished, PR author should check the comments and take action accordingly
  • @rustbot review: the author is ready for a review, this PR will be queued again in the reviewer's queue

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 3, 2025
@rustbot
Copy link
Collaborator

rustbot commented Feb 3, 2025

Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter
gets adapted for the changes, if necessary.

cc @rust-lang/miri, @rust-lang/wg-const-eval

@calder
Copy link
Contributor Author

calder commented Feb 3, 2025

Thanks for the speedy review @saethlin! I had a couple questions:

  • Should we add these methods to f16 and f128 as well?
  • Are intrinsic wrappers tested explicitly? I see tests for the intrinsics themselves but not for other wrappers.

@RalfJung
Copy link
Member

RalfJung commented Feb 3, 2025 via email

@RalfJung
Copy link
Member

RalfJung commented Feb 3, 2025 via email

@calder
Copy link
Contributor Author

calder commented Feb 3, 2025

@RalfJung: Thanks for the quick response!

@RalfJung
Copy link
Member

RalfJung commented Feb 3, 2025

https://doc.rust-lang.org/nightly/std/primitive.f32.html seems like a good place, that's where we already document everything special around NaNs. So a new section on algebraic operations there would probably be a good fit.

@calder
Copy link
Contributor Author

calder commented Feb 4, 2025

Changes

  • Added f16 and f128 methods as requested by @tgross35 here.
  • Renamed to algebraic_*() to match checked_*() as requested by @tgross35 here.
  • Added central documentation with verbage suggested by @RalfJung here.

calder added a commit to calder/rust that referenced this pull request Feb 4, 2025
@tgross35
Copy link
Contributor

tgross35 commented Feb 4, 2025

I don't see an existing codegen test for the intrinsics so these should probably get one. https://github.com/rust-lang/rust/tree/3f33b30e19b7597a3acbca19e46d9e308865a0fe/tests/codegen/float would be a reasonable home.

For reference, this would just be a file containing functions like this for each of the new methods, in order to verify the flags that we expect are getting set:

// CHECK-LABEL: float @f32_algebraic_add(
#[no_mangle]
pub fn f32_algebraic_add(a: f32, b: f32) -> f32 {
    // CHECK: fadd reassoc nsz arcp contract float %a, %b
    a.algebraic_add(b)
}

@rustbot
Copy link
Collaborator

rustbot commented Feb 5, 2025

The Miri subtree was changed

cc @rust-lang/miri

// CHECK-LABEL: fp128 @f128_algebraic_add(
#[no_mangle]
pub fn f128_algebraic_add(a: f128, b: f128) -> f128 {
// CHECK: fadd reassoc nsz arcp contract fp128 {{(%a, %b)|(%b, %a)}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition and multiplication cases both end up as %b, %a rather than %a, %b which surprised me but isn't incorrect. I opted to allow either in case behavior changes in the future.

@tgross35
Copy link
Contributor

tgross35 commented Feb 5, 2025

This looks pretty reasonable to me but all the public functions should get some examples. I think it may also be good to give a small demo of how this may work at the end of the new "Algebraic operators" section, e.g.:

For example, the below:

```
x = x.algebraic_add(x, a.algebraic_mul(b));
```

May be rewritten as either of the following:

```
x = x + (a * b); // As written
x = (a * b) + x; // Reordered to allow using a single `fma`
```

@calder
Copy link
Contributor Author

calder commented Feb 5, 2025

Per function examples

Did you have specific examples in mind? I'm struggling to think of ones that aren't repetitive / low signal-to-noise (i.e. assert this algebraic add is approximately equal to a normal add). Should we instead link to the central documentation and test approximate equality of each of the ops in normal tests like this so they don't clutter the documentation?

Central example

Made a couple small edits for brevity. How's this look?

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@calder
Copy link
Contributor Author

calder commented Apr 3, 2025

@RalfJung: Leaving Miri behavior as is since we're so close, but thanks!

@tgross35: One more try? (Let me know if there's a way to run all these tests locally to speed up the iteration process)

@RalfJung
Copy link
Member

RalfJung commented Apr 3, 2025

@bors try

You can run this locally with ./x miri std -- algebraic.

@bors
Copy link
Collaborator

bors commented Apr 3, 2025

⌛ Trying commit 9602bbb with merge 7a97f5e...

bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 3, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
@calder
Copy link
Contributor Author

calder commented Apr 3, 2025

Yeah, that's what I've been running but the last failure was on x86_64-gnu-aux not Miri. Is there a good way to run the full suite that Bors is running?

@RalfJung
Copy link
Member

RalfJung commented Apr 3, 2025

the last failure was on x86_64-gnu-aux not Miri.

x86_64-gnu-aux is the name of a test runner. That runner runs many tests. Among them, it runs ./x miri.

@calder
Copy link
Contributor Author

calder commented Apr 3, 2025

Gotcha, thanks! Confirmed I could reproduce with:

MIRIFLAGS="-Zmiri-many-seeds" ./x miri std -- algebraic

which is now passing at 9602bbb.

@bors
Copy link
Collaborator

bors commented Apr 3, 2025

☀️ Try build successful - checks-actions
Build commit: 7a97f5e (7a97f5ea73db9a162b151c4dda7cde3c8fc1afc0)

@tgross35
Copy link
Contributor

tgross35 commented Apr 4, 2025

Could the tests get something like let approx = if cfg!(miri) { 1e-3 } else { 0.0 }; and use that for the assert_approx_eq! tolerance? Similar to #138062 the API doesn't guarantee much so it makes sense to have the imprecision in Miri. This current implementation should be exact though so it would be nice if the internal tests flag any problems there.

@RalfJung
Copy link
Member

RalfJung commented Apr 4, 2025

Such logic should be accompanied by comments explaining that this test does not imply any form of guarantee, it is merely a sentinel for us to notice when the real precision of these operations goes down.

@calder
Copy link
Contributor Author

calder commented Apr 4, 2025

Updated all tests to assert exact equality outside of MIRI.

To make this work I had to change assert_approx_eq() from diff < epsilon to diff <= epsilon:

--- a/library/std/tests/floats/lib.rs
+++ b/library/std/tests/floats/lib.rs
@@ -10,7 +10,7 @@ macro_rules! assert_approx_eq {
         let (a, b) = (&$a, &$b);
         let diff = (*a - *b).abs();
         assert!(
-            diff < $lim,
+            diff <= $lim,
             "{a:?} is not approximately equal to {b:?} (threshold {lim:?}, difference {diff:?})",
             lim = $lim
         );

266ca8b#diff-a4ab8b2578f382b20890fc918f6cdbffcaa057078f8793f0d974c011003a2b01R13

If we don't want to loosen that we'll need to do something like this instead:

--- a/library/std/tests/floats/f32.rs
+++ b/library/std/tests/floats/f32.rs
@@ -926,13 +926,17 @@ fn test_algebraic() {
     // This is a check of current implementations and does NOT imply any form of
     // guarantee about future behavior. The compiler reserves the right to make
     // these operations inexact matches in the future.
-    let eps_add = if cfg!(miri) { 1e-3 } else { 0.0 };
-    let eps_mul = if cfg!(miri) { 1e-1 } else { 0.0 };
-    let eps_div = if cfg!(miri) { 1e-4 } else { 0.0 };
-
-    assert_approx_eq!(a.algebraic_add(b), a + b, eps_add);
-    assert_approx_eq!(a.algebraic_sub(b), a - b, eps_add);
-    assert_approx_eq!(a.algebraic_mul(b), a * b, eps_mul);
-    assert_approx_eq!(a.algebraic_div(b), a / b, eps_div);
-    assert_approx_eq!(a.algebraic_rem(b), a % b, eps_div);
+    if cfg!(miri) {
+        assert_approx_eq!(a.algebraic_add(b), a + b, 1e-3);
+        assert_approx_eq!(a.algebraic_sub(b), a - b, 1e-3);
+        assert_approx_eq!(a.algebraic_mul(b), a * b, 1e-1);
+        assert_approx_eq!(a.algebraic_div(b), a / b, 1e-4);
+        assert_approx_eq!(a.algebraic_rem(b), a % b, 1e-4);
+    } else {
+        assert_eq!(a.algebraic_add(b), a + b);
+        assert_eq!(a.algebraic_sub(b), a - b);
+        assert_eq!(a.algebraic_mul(b), a * b);
+        assert_eq!(a.algebraic_div(b), a / b);
+        assert_eq!(a.algebraic_rem(b), a % b);
+    }
 }

@tgross35
Copy link
Contributor

tgross35 commented Apr 4, 2025

Latest changes lgtm. r=me with a squash and a passing try

@bors try

@tgross35 tgross35 assigned tgross35 and unassigned thomcc Apr 4, 2025
@bors
Copy link
Collaborator

bors commented Apr 4, 2025

⌛ Testing commit 8ff7052 with merge 994f291efdb7969b06dff1ade574e447dbfacd91...

bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 4, 2025
Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
@bors
Copy link
Collaborator

bors commented Apr 5, 2025

☀️ Try build successful - checks-actions
Build commit: 994f291 (994f291efdb7969b06dff1ade574e447dbfacd91)

@tgross35
Copy link
Contributor

tgross35 commented Apr 5, 2025

@bors r+

@bors
Copy link
Collaborator

bors commented Apr 5, 2025

📌 Commit 8ff7052 has been approved by tgross35

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Apr 5, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 5, 2025
Rollup of 11 pull requests

Successful merges:

 - rust-lang#136457 (Expose algebraic floating point intrinsics)
 - rust-lang#137880 (Autodiff batching)
 - rust-lang#137897 (fix pthread-based tls on apple targets)
 - rust-lang#138024 (Allow optimizing out `panic_bounds_check` in Unicode checks.)
 - rust-lang#138546 (Add integer to string formatting tests)
 - rust-lang#138826 (StableMIR: Add `associated_items`.)
 - rust-lang#138950 (replace extra_filename with strict version hash in metrics file names)
 - rust-lang#139274 (Rustdoc: typecheck settings.js)
 - rust-lang#139285 (use lower case to match other error messages)
 - rust-lang#139341 (Apply `Recovery::Forbidden` when reparsing pasted macro fragments.)
 - rust-lang#139389 (make `Arguments::as_statically_known_str` doc(hidden))

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 2e4e196 into rust-lang:master Apr 5, 2025
7 checks passed
@rustbot rustbot added this to the 1.88.0 milestone Apr 5, 2025
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Apr 5, 2025
Rollup merge of rust-lang#136457 - calder:master, r=tgross35

Expose algebraic floating point intrinsics

# Problem

A stable Rust implementation of a simple dot product is 8x slower than C++ on modern x86-64 CPUs. The root cause is an inability to let the compiler reorder floating point operations for better vectorization.

See https://github.com/calder/dot-bench for benchmarks. Measurements below were performed on a i7-10875H.

### C++: 10us ✅

With Clang 18.1.3 and `-O2 -march=haswell`:
<table>
<tr>
    <th>C++</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="cc">
float dot(float *a, float *b, size_t len) {
    #pragma clang fp reassociate(on)
    float sum = 0.0;
    for (size_t i = 0; i < len; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/739573c0-380a-4d84-9fd9-141343ce7e68" />
</td>
</tr>
</table>

### Nightly Rust: 10us ✅

With rustc 1.86.0-nightly (8239a37) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum = fadd_algebraic(sum, fmul_algebraic(a[i], b[i]));
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/9dcf953a-2cd7-42f3-bc34-7117de4c5fb9" />
</td>
</tr>
</table>

### Stable Rust: 84us ❌

With rustc 1.84.1 (e71f9a9) and `-C opt-level=3 -C target-feature=+avx2,+fma`:
<table>
<tr>
    <th>Rust</th>
    <th>Assembly</th>
</tr>
<tr>
<td>
<pre lang="rust">
fn dot(a: &[f32], b: &[f32]) -> f32 {
    let mut sum = 0.0;
    for i in 0..a.len() {
        sum += a[i] * b[i];
    }
    sum
}
</pre>
</td>
<td>
<img src="https://github.com/user-attachments/assets/936a1f7e-33e4-4ff8-a732-c3cdfe068dca" />
</td>
</tr>
</table>

# Proposed Change

Add `core::intrinsics::f*_algebraic` wrappers to `f16`, `f32`, `f64`, and `f128` gated on a new `float_algebraic` feature.

# Alternatives Considered

rust-lang#21690 has a lot of good discussion of various options for supporting fast math in Rust, but is still open a decade later because any choice that opts in more than individual operations is ultimately contrary to Rust's design principles.

In the mean time, processors have evolved and we're leaving major performance on the table by not supporting vectorization. We shouldn't make users choose between an unstable compiler and an 8x performance hit.

# References

* rust-lang#21690
* rust-lang/libs-team#532
* rust-lang#136469
* https://github.com/calder/dot-bench
* https://www.felixcloutier.com/x86/vfmadd132ps:vfmadd213ps:vfmadd231ps

try-job: x86_64-gnu-nopt
try-job: x86_64-gnu-aux
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.