A case of compound x86_64 performance regression caused by LLVM 20 and #124810 #139730

MoSal · 2025-04-13T02:25:22Z

Code

I tried this code:

use std::{iter::Peekable, str::CharIndices};
use std::sync::LazyLock as Lazy;

use memchr::memmem::Finder;
use itertools::Itertools;


struct Foo<'a> {
    txt: &'a str,
    ci: Peekable<CharIndices<'a>>,
}

impl<'a> Foo<'a> {
    fn new(txt: &'a str) -> Self {
        let ci = txt.char_indices().peekable();
        Self { txt, ci }
    }

    fn next(&mut self) -> Option<(usize, usize)> {
        static FI: Lazy<Finder> = Lazy::new(|| Finder::new(b"\n"));

        // XXX: switching between these two lines makes a difference
        let &(start, _) = self.ci.peek()?;
        //let start = self.ci.peek().map(|&(idx, _)| idx)?;
        
        let matches = FI
            .find_iter(&self.txt.as_bytes()[start..])
            .map(|idx| idx + start);

        for next_match in matches {
            if self.txt.is_char_boundary(next_match) {
                self.ci.by_ref()
                    //.take_while(|&(idx, _)| idx < next_match)
                    .peeking_take_while(|&(idx, _)| idx <= next_match)
                    .for_each(drop);
                return Some((start, next_match))
            }
        }

        self.ci.by_ref().for_each(drop);
        Some((start, self.txt.len()))
    }
}

fn main() {
    let s = "1".repeat(20_000) + "\n";
    let sn = s.repeat(200_000);
    let mut v = Foo::new(&sn);
    let mut p = Vec::with_capacity(200_000);
    while let Some(r) = v.next() {
        p.push(r);
    }
}

Cargo.toml

[package]
name = "tt-fluctuations"
version = "0.1.0"
edition = "2024"

[dependencies]
itertools = "0.14.0"
memchr = "2.7.4"

[profile.release]
codegen-units = 1
lto = true
strip = true

Cargo.lock

# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 4

[[package]]
name = "either"
version = "1.15.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "48c757948c5ede0e46177b7add2e67155f70e33c07fea8284df6576da70b3719"

[[package]]
name = "itertools"
version = "0.14.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2b192c782037fadd9cfa75548310488aabdbf3d2da73885b31bd0abd03351285"
dependencies = [
 "either",
]

[[package]]
name = "memchr"
version = "2.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78ca9ab1a0babb1e7d5695e3530886289c18cf2f87ec19a575a0abdce112e3a3"

[[package]]
name = "tt-fluctuations"
version = "0.1.0"
dependencies = [
 "itertools",
 "memchr",
]

Commits used to test:

2162e9d (or stable 1.86.0) => before LLVM 20
ce36a96 (Update to LLVM 20 #135763) => First regression
934880f (speed up String::push and String::insert #124810) => Second regression

Besides the two direct regressions. The fact that alternating between these two lines may cause
a big difference in performance is curious:

let &(start, _) = self.ci.peek()?; // called without_map in bench
let start = self.ci.peek().map(|&(idx, _)| idx)?; // with_map

But maybe that's a different issue.

Summary of hyperfine run:

Summary
  ./2162e9d_with_map ran
    1.06 ± 0.01 times faster than ./ce36a96_with_map
    1.12 ± 0.01 times faster than ./2162e9d_without_map
    1.38 ± 0.01 times faster than ./934880f586f_with_map
    1.66 ± 0.01 times faster than ./ce36a96_without_map
    1.91 ± 0.01 times faster than ./934880f586f_without_map

Full numbers

Benchmark 1: ./2162e9d_with_map
  Time (mean ± σ):      3.894 s ±  0.014 s    [User: 3.202 s, System: 0.682 s]
  Range (min … max):    3.876 s …  3.909 s    4 runs

Benchmark 2: ./2162e9d_without_map
  Time (mean ± σ):      4.380 s ±  0.034 s    [User: 3.686 s, System: 0.683 s]
  Range (min … max):    4.346 s …  4.419 s    4 runs

Benchmark 3: ./934880f586f_with_map
  Time (mean ± σ):      5.388 s ±  0.020 s    [User: 4.702 s, System: 0.671 s]
  Range (min … max):    5.369 s …  5.415 s    4 runs

Benchmark 4: ./934880f586f_without_map
  Time (mean ± σ):      7.448 s ±  0.018 s    [User: 6.758 s, System: 0.666 s]
  Range (min … max):    7.423 s …  7.465 s    4 runs

Benchmark 5: ./ce36a96_with_map
  Time (mean ± σ):      4.128 s ±  0.016 s    [User: 3.448 s, System: 0.666 s]
  Range (min … max):    4.111 s …  4.149 s    4 runs

Benchmark 6: ./ce36a96_without_map
  Time (mean ± σ):      6.448 s ±  0.007 s    [User: 5.734 s, System: 0.696 s]
  Range (min … max):    6.441 s …  6.457 s    4 runs

Summary
  ./2162e9d_with_map ran
    1.06 ± 0.01 times faster than ./ce36a96_with_map
    1.12 ± 0.01 times faster than ./2162e9d_without_map
    1.38 ± 0.01 times faster than ./934880f586f_with_map
    1.66 ± 0.01 times faster than ./ce36a96_without_map
    1.91 ± 0.01 times faster than ./934880f586f_without_map

So building with current nightly and

let &(start, _) = self.ci.peek()?;

is almost twice as slow than building with stable and using

let start = self.ci.peek().map(|&(idx, _)| idx)?;

Version it worked on

It most recently worked on: 2162e9d

Version with regression

First regression

rustc --version --verbose:

rustc 1.87.0-nightly (ce36a966c 2025-02-17)
binary: rustc
commit-hash: ce36a966c79e109dabeef7a47fe68e5294c6d71e
commit-date: 2025-02-17
host: x86_64-unknown-linux-gnu
release: 1.87.0-nightly
LLVM version: 20.1.0

Second regression

rustc --version --verbose:

rustc 1.88.0-nightly (934880f58 2025-04-09)
binary: rustc
commit-hash: 934880f586f6ac1f952c7090e2a943fcd7775e7b
commit-date: 2025-04-09
host: x86_64-unknown-linux-gnu
release: 1.88.0-nightly
LLVM version: 20.1.2

The text was updated successfully, but these errors were encountered:

apiraino · 2025-04-14T09:35:08Z

Assigning priority (discussion on Zulip).

Possibly related to #139370

@rustbot label -I-prioritize +P-high

lincot · 2025-04-19T11:56:52Z

A quick run on my x86_64 comparing nightly-04-09 (d4f880f) and nightly-04-10 (934880f) showed
no difference without using map, but curiously a 1.76x slowdown with map:

Summary
  ./stable_with_map ran
    1.00 ± 0.02 times faster than ./2025-04-09_with_map
    1.00 ± 0.03 times faster than ./stable_without_map
    1.21 ± 0.02 times faster than ./2025-04-10_without_map
    1.22 ± 0.02 times faster than ./2025-04-09_without_map
    1.77 ± 0.04 times faster than ./2025-04-10_with_map

Full results

Benchmark 1: ./stable_with_map
  Time (mean ± σ):      2.480 s ±  0.044 s    [User: 2.290 s, System: 0.182 s]
  Range (min … max):    2.430 s …  2.519 s    4 runs

Benchmark 2: ./stable_without_map
  Time (mean ± σ):      2.489 s ±  0.043 s    [User: 2.308 s, System: 0.173 s]
  Range (min … max):    2.437 s …  2.525 s    4 runs

Benchmark 3: ./2025-04-09_with_map
  Time (mean ± σ):      2.486 s ±  0.034 s    [User: 2.309 s, System: 0.170 s]
  Range (min … max):    2.442 s …  2.520 s    4 runs

Benchmark 4: ./2025-04-09_without_map
  Time (mean ± σ):      3.030 s ±  0.007 s    [User: 2.854 s, System: 0.167 s]
  Range (min … max):    3.019 s …  3.035 s    4 runs

Benchmark 5: ./2025-04-10_with_map
  Time (mean ± σ):      4.381 s ±  0.039 s    [User: 4.198 s, System: 0.169 s]
  Range (min … max):    4.335 s …  4.413 s    4 runs

Benchmark 6: ./2025-04-10_without_map
  Time (mean ± σ):      3.005 s ±  0.023 s    [User: 2.817 s, System: 0.178 s]
  Range (min … max):    2.981 s …  3.033 s    4 runs

I don't see how #124810 could cause this, though.

MoSal · 2025-04-19T13:34:52Z

@lincot

Interesting. Maybe it's another case of phantom behavior from LLVM 20!

I just redid my original full hyperfine run which included 48f89e7:

  ./2162e9d_with_map ran
    1.04 ± 0.01 times faster than ./ce36a96_with_map
    1.05 ± 0.01 times faster than ./48f89e76596_with_map
    1.10 ± 0.01 times faster than ./2162e9d_without_map
    1.32 ± 0.01 times faster than ./934880f586f_with_map
    1.54 ± 0.01 times faster than ./ce36a96_without_map
    1.55 ± 0.01 times faster than ./48f89e76596_without_map
    1.76 ± 0.02 times faster than ./934880f586f_without_map

The difference between 48f89e7 and 934880f is what lead me to believe that #124810 is the (supposed) culprit.

For reference: Tests were run on a machine with i7-7700K CPU with frequency locked at 4GHz.

MoSal added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Apr 13, 2025

rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Apr 13, 2025

rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Apr 14, 2025

jieyouxu removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A case of compound x86_64 performance regression caused by LLVM 20 and #124810 #139730

A case of compound x86_64 performance regression caused by LLVM 20 and #124810 #139730

MoSal commented Apr 13, 2025

apiraino commented Apr 14, 2025

lincot commented Apr 19, 2025

MoSal commented Apr 19, 2025

A case of compound x86_64 performance regression caused by LLVM 20 and #124810 #139730

A case of compound x86_64 performance regression caused by LLVM 20 and #124810 #139730

Comments

MoSal commented Apr 13, 2025

Code

Version it worked on

Version with regression

First regression

Second regression

apiraino commented Apr 14, 2025

lincot commented Apr 19, 2025

MoSal commented Apr 19, 2025