Skip to content

A case of compound x86_64 performance regression caused by LLVM 20 and #124810 #139730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MoSal opened this issue Apr 13, 2025 · 3 comments
Open
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such P-high High priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@MoSal
Copy link

MoSal commented Apr 13, 2025

Code

I tried this code:

use std::{iter::Peekable, str::CharIndices};
use std::sync::LazyLock as Lazy;

use memchr::memmem::Finder;
use itertools::Itertools;


struct Foo<'a> {
    txt: &'a str,
    ci: Peekable<CharIndices<'a>>,
}

impl<'a> Foo<'a> {
    fn new(txt: &'a str) -> Self {
        let ci = txt.char_indices().peekable();
        Self { txt, ci }
    }

    fn next(&mut self) -> Option<(usize, usize)> {
        static FI: Lazy<Finder> = Lazy::new(|| Finder::new(b"\n"));

        // XXX: switching between these two lines makes a difference
        let &(start, _) = self.ci.peek()?;
        //let start = self.ci.peek().map(|&(idx, _)| idx)?;
        
        let matches = FI
            .find_iter(&self.txt.as_bytes()[start..])
            .map(|idx| idx + start);

        for next_match in matches {
            if self.txt.is_char_boundary(next_match) {
                self.ci.by_ref()
                    //.take_while(|&(idx, _)| idx < next_match)
                    .peeking_take_while(|&(idx, _)| idx <= next_match)
                    .for_each(drop);
                return Some((start, next_match))
            }
        }

        self.ci.by_ref().for_each(drop);
        Some((start, self.txt.len()))
    }
}

fn main() {
    let s = "1".repeat(20_000) + "\n";
    let sn = s.repeat(200_000);
    let mut v = Foo::new(&sn);
    let mut p = Vec::with_capacity(200_000);
    while let Some(r) = v.next() {
        p.push(r);
    }
}

Cargo.toml

[package]
name = "tt-fluctuations"
version = "0.1.0"
edition = "2024"

[dependencies]
itertools = "0.14.0"
memchr = "2.7.4"

[profile.release]
codegen-units = 1
lto = true
strip = true
Cargo.lock

# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 4

[[package]]
name = "either"
version = "1.15.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "48c757948c5ede0e46177b7add2e67155f70e33c07fea8284df6576da70b3719"

[[package]]
name = "itertools"
version = "0.14.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2b192c782037fadd9cfa75548310488aabdbf3d2da73885b31bd0abd03351285"
dependencies = [
 "either",
]

[[package]]
name = "memchr"
version = "2.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78ca9ab1a0babb1e7d5695e3530886289c18cf2f87ec19a575a0abdce112e3a3"

[[package]]
name = "tt-fluctuations"
version = "0.1.0"
dependencies = [
 "itertools",
 "memchr",
]

Commits used to test:

Besides the two direct regressions. The fact that alternating between these two lines may cause
a big difference in performance is curious:

let &(start, _) = self.ci.peek()?; // called without_map in bench
let start = self.ci.peek().map(|&(idx, _)| idx)?; // with_map

But maybe that's a different issue.

Summary of hyperfine run:

Summary
  ./2162e9d_with_map ran
    1.06 ± 0.01 times faster than ./ce36a96_with_map
    1.12 ± 0.01 times faster than ./2162e9d_without_map
    1.38 ± 0.01 times faster than ./934880f586f_with_map
    1.66 ± 0.01 times faster than ./ce36a96_without_map
    1.91 ± 0.01 times faster than ./934880f586f_without_map

Full numbers

Benchmark 1: ./2162e9d_with_map
  Time (mean ± σ):      3.894 s ±  0.014 s    [User: 3.202 s, System: 0.682 s]
  Range (min … max):    3.876 s …  3.909 s    4 runs

Benchmark 2: ./2162e9d_without_map
  Time (mean ± σ):      4.380 s ±  0.034 s    [User: 3.686 s, System: 0.683 s]
  Range (min … max):    4.346 s …  4.419 s    4 runs

Benchmark 3: ./934880f586f_with_map
  Time (mean ± σ):      5.388 s ±  0.020 s    [User: 4.702 s, System: 0.671 s]
  Range (min … max):    5.369 s …  5.415 s    4 runs

Benchmark 4: ./934880f586f_without_map
  Time (mean ± σ):      7.448 s ±  0.018 s    [User: 6.758 s, System: 0.666 s]
  Range (min … max):    7.423 s …  7.465 s    4 runs

Benchmark 5: ./ce36a96_with_map
  Time (mean ± σ):      4.128 s ±  0.016 s    [User: 3.448 s, System: 0.666 s]
  Range (min … max):    4.111 s …  4.149 s    4 runs

Benchmark 6: ./ce36a96_without_map
  Time (mean ± σ):      6.448 s ±  0.007 s    [User: 5.734 s, System: 0.696 s]
  Range (min … max):    6.441 s …  6.457 s    4 runs

Summary
  ./2162e9d_with_map ran
    1.06 ± 0.01 times faster than ./ce36a96_with_map
    1.12 ± 0.01 times faster than ./2162e9d_without_map
    1.38 ± 0.01 times faster than ./934880f586f_with_map
    1.66 ± 0.01 times faster than ./ce36a96_without_map
    1.91 ± 0.01 times faster than ./934880f586f_without_map

So building with current nightly and

let &(start, _) = self.ci.peek()?;

is almost twice as slow than building with stable and using

let start = self.ci.peek().map(|&(idx, _)| idx)?;

Version it worked on

It most recently worked on: 2162e9d

Version with regression

First regression

rustc --version --verbose:

rustc 1.87.0-nightly (ce36a966c 2025-02-17)
binary: rustc
commit-hash: ce36a966c79e109dabeef7a47fe68e5294c6d71e
commit-date: 2025-02-17
host: x86_64-unknown-linux-gnu
release: 1.87.0-nightly
LLVM version: 20.1.0

Second regression

rustc --version --verbose:

rustc 1.88.0-nightly (934880f58 2025-04-09)
binary: rustc
commit-hash: 934880f586f6ac1f952c7090e2a943fcd7775e7b
commit-date: 2025-04-09
host: x86_64-unknown-linux-gnu
release: 1.88.0-nightly
LLVM version: 20.1.2
@MoSal MoSal added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Apr 13, 2025
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Apr 13, 2025
@jieyouxu jieyouxu added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such and removed C-bug Category: This is a bug. labels Apr 13, 2025
@apiraino
Copy link
Contributor

Assigning priority (discussion on Zulip).

Possibly related to #139370

@rustbot label -I-prioritize +P-high

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Apr 14, 2025
@jieyouxu jieyouxu removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 14, 2025
@lincot
Copy link
Contributor

lincot commented Apr 19, 2025

A quick run on my x86_64 comparing nightly-04-09 (d4f880f) and nightly-04-10 (934880f) showed
no difference without using map, but curiously a 1.76x slowdown with map:

Summary
  ./stable_with_map ran
    1.00 ± 0.02 times faster than ./2025-04-09_with_map
    1.00 ± 0.03 times faster than ./stable_without_map
    1.21 ± 0.02 times faster than ./2025-04-10_without_map
    1.22 ± 0.02 times faster than ./2025-04-09_without_map
    1.77 ± 0.04 times faster than ./2025-04-10_with_map
Full results
Benchmark 1: ./stable_with_map
  Time (mean ± σ):      2.480 s ±  0.044 s    [User: 2.290 s, System: 0.182 s]
  Range (min … max):    2.430 s …  2.519 s    4 runs

Benchmark 2: ./stable_without_map
  Time (mean ± σ):      2.489 s ±  0.043 s    [User: 2.308 s, System: 0.173 s]
  Range (min … max):    2.437 s …  2.525 s    4 runs

Benchmark 3: ./2025-04-09_with_map
  Time (mean ± σ):      2.486 s ±  0.034 s    [User: 2.309 s, System: 0.170 s]
  Range (min … max):    2.442 s …  2.520 s    4 runs

Benchmark 4: ./2025-04-09_without_map
  Time (mean ± σ):      3.030 s ±  0.007 s    [User: 2.854 s, System: 0.167 s]
  Range (min … max):    3.019 s …  3.035 s    4 runs

Benchmark 5: ./2025-04-10_with_map
  Time (mean ± σ):      4.381 s ±  0.039 s    [User: 4.198 s, System: 0.169 s]
  Range (min … max):    4.335 s …  4.413 s    4 runs

Benchmark 6: ./2025-04-10_without_map
  Time (mean ± σ):      3.005 s ±  0.023 s    [User: 2.817 s, System: 0.178 s]
  Range (min … max):    2.981 s …  3.033 s    4 runs

I don't see how #124810 could cause this, though.

@MoSal
Copy link
Author

MoSal commented Apr 19, 2025

@lincot

Interesting. Maybe it's another case of phantom behavior from LLVM 20!

I just redid my original full hyperfine run which included 48f89e7:

  ./2162e9d_with_map ran
    1.04 ± 0.01 times faster than ./ce36a96_with_map
    1.05 ± 0.01 times faster than ./48f89e76596_with_map
    1.10 ± 0.01 times faster than ./2162e9d_without_map
    1.32 ± 0.01 times faster than ./934880f586f_with_map
    1.54 ± 0.01 times faster than ./ce36a96_without_map
    1.55 ± 0.01 times faster than ./48f89e76596_without_map
    1.76 ± 0.02 times faster than ./934880f586f_without_map

The difference between 48f89e7 and 934880f is what lead me to believe that #124810 is the (supposed) culprit.

For reference: Tests were run on a machine with i7-7700K CPU with frequency locked at 4GHz.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such P-high High priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants