Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic overflow in rustc-rayon-core causing CI failures #90227

Open
hkratz opened this issue Oct 24, 2021 · 11 comments
Open

Non-deterministic overflow in rustc-rayon-core causing CI failures #90227

hkratz opened this issue Oct 24, 2021 · 11 comments
Labels
A-parallel-queries Area: Parallel query execution T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-parallel Working group: Parallelizing the compiler

Comments

@hkratz
Copy link
Contributor

hkratz commented Oct 24, 2021

It seems that enabling overflow checks (#89776) uncovered non-deterministic overflows in rustc-rayon-core causing the build failures of #90042 and #90222:

Overflow location:
https://github.com/rust-lang/rustc-rayon/blob/c8ec88d8a2236d1fe19d65a4ab38834f76d256b7/rayon-core/src/sleep/mod.rs#L330

This code is not in upstream rayon but just in our fork, introduced here:
rust-lang/rustc-rayon@27911f7#diff-cde9d726ca2f32c319420be6c61d2e57ad9daff7f85a5f8f7be25474a15f9c4dR328

@hkratz hkratz changed the title Non-deterministic overflows in rustc-rayon-core causing CI failures Non-deterministic overflow in rustc-rayon-core causing CI failures Oct 24, 2021
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Oct 24, 2021
. rust-lang#89776 enabled overflow checks in CI but these lead to two failures already:

rust-lang#90042 (comment)
rust-lang#90222 (comment)

The (first?) problem has been identified: rust-lang#90227

This PR temporarily disables the overflow checks again so we don't have to deal with the "spurious" CI failures until rustc-rayon is fixed.

Zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/ci.20failed.20with.20.20.22attempt.20to.20subtract.20with.20overflow.22
@hkratz
Copy link
Contributor Author

hkratz commented Oct 24, 2021

@rustbot label +T-compiler +A-parallel-queries

@rustbot rustbot added A-parallel-queries Area: Parallel query execution T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 24, 2021
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Oct 24, 2021
…rflow-checks, r=Mark-Simulacrum

Temporarily turn overflow checks off for rustc-rayon-core

The rustc fork of Rayon has deadlock detection code which intermittently causes overflows in the CI (see rust-lang#90227). So, as a workaround, we unconditionally turn overflow checks off for this crate only.

This workaround should be removed once rust-lang#90227 is fixed.

r? `@Mark-Simulacrum`

cc `@matthiaskrgr`
@hkratz
Copy link
Contributor Author

hkratz commented Oct 24, 2021

Interestingly enough the errors only appeared on Windows, in rustdoc and in builds which don't have the parallel queries enabled. So I guess Rustdocs use of rustc-rayon triggers it?

if !self.sync_only && cfg!(windows) {
// A possible future enhancement after more detailed profiling would
// be to create the file sync so errors are reported eagerly.
let sender = self.errors.clone().expect("can't write after closing");
rayon::spawn(move || {
fs::write(&path, contents).unwrap_or_else(|e| {
sender.send(format!("\"{}\": {}", path.display(), e)).unwrap_or_else(|_| {
panic!("failed to send error on \"{}\"", path.display())
})
});
});
} else {

@camelid
Copy link
Member

camelid commented Oct 27, 2021

cc @jyn514 since IIRC you were looking into the DocFS code at some point recently

@jyn514
Copy link
Member

jyn514 commented Oct 27, 2021

I don't have any insight here, I think this just happens to be the only part of the compiler using rayon when cfg(parallel_compiler) is disabled.

@Mark-Simulacrum
Copy link
Member

Hm... I would suggest that rustdoc probably wants to use mainline rayon here, rather than rustc's fork of it, since I/O code likely doesn't want to interop with jobserver and such. I don't know whether that will fix this particular issue, but it seems like a good idea.

@jyn514
Copy link
Member

jyn514 commented Oct 27, 2021

@Mark-Simulacrum what jobserver are you referring to here? I don't know what the differences between mainline rayon and rustc's fork are.

@Mark-Simulacrum
Copy link
Member

The fork is primarily intended to interface with the jobserver that Cargo creates (though I don't know that rustdoc ever configures it as such) to ensure that rayon's worker threads are not using more than the allocated share of CPU resources. This is the same jobserver that manages e.g. LLVM parallelism within each rustc compilation, for example, to ensure there's at most "ncpu" threads running at the same time.

For I/O heavy work like this, I'd expect that you don't want that - your threads are mostly blocked on I/O completion which is not CPU-heavy for the most part.

camelid added a commit to camelid/rust that referenced this issue Oct 29, 2021
The rustc fork of rayon integrates with Cargo's jobserver to limit the
amount of parallelism. However, rustdoc's use case is concurrent I/O,
which is not CPU-heavy, so it should be able to use mainline rayon.

See this discussion [1] for more details.

[1]: rust-lang#90227 (comment)

Note: I chose rayon 1.3.1 so that the rayon version used elsewhere in
the workspace does not change.
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 29, 2021
rustdoc: Switch to mainline rayon

The rustc fork of rayon integrates with Cargo's jobserver to limit the
amount of parallelism. However, rustdoc's use case is concurrent I/O,
which is not CPU-heavy, so it should be able to use mainline rayon.

See [this discussion][1] for more details.

[1]: rust-lang#90227 (comment)

Note: I chose rayon 1.3.1 so that the rayon version used elsewhere in
the workspace does not change.

r? `@Mark-Simulacrum`
cc `@jyn514`
@the8472
Copy link
Member

the8472 commented Oct 29, 2021

Hm... I would suggest that rustdoc probably wants to use mainline rayon here, rather than rustc's fork of it, since I/O code likely doesn't want to interop with jobserver and such. I don't know whether that will fix this particular issue, but it seems like a good idea.

Is that really the right solution? It seems like the overflow checks have found a bug in rustc_rayon which was exposed by the use in rustdoc. Removing the use in rustdoc does not remove the fix additions in the custom rayon version.

Isn't the goal of adding overflow checks to uncover latent bugs and then fix them rather than finding new ways to ignore them?

@Mark-Simulacrum
Copy link
Member

We should still try to fix the bug, absolutely, and investigate whether it manifests in upstream rayon as well. However, I don't think that changes the assessment that relying on upstream rayon is likely better than the fork in the long run for rustdoc's use case.

@camelid
Copy link
Member

camelid commented Oct 29, 2021

Is that really the right solution? It seems like the overflow checks have found a bug in rustc_rayon which was exposed by the use in rustdoc. Removing the use in rustdoc does not remove the fix additions in the custom rayon version.

Yeah, we switched to mainline rayon mostly for other reasons. That's why I left this issue open :)

@Mark-Simulacrum Mark-Simulacrum added the WG-compiler-parallel Working group: Parallelizing the compiler label Jan 6, 2024
@saethlin
Copy link
Member

@hkratz Please paste the contents of links in issues instead of just providing a URL. Those links to CI logs now dead, so the only hint now about how to reproduce this is

on Windows, in rustdoc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parallel-queries Area: Parallel query execution T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-parallel Working group: Parallelizing the compiler
Projects
None yet
Development

No branches or pull requests

7 participants