-
-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix release build crash on arch linux #5601
Conversation
This is extremely unlikely to be the actual cause of the crash. Multiplication and double-to-float conversion should not throw any floating-point exceptions and if there's a segfault it's not the fault of this specific line. Additionally, without a backtrace + log or other detailed error context, I cannot approve this PR. "Game crashes" is very nebulous and could be related to several different issues. |
I'm with you completely. I did check for nulls in case it's a race condition, but validating the design further is arduous in a new code base. The PR was mainly to suggest if this could be the fix. I'll look into why this actually fixed the issue when I have time for it as I found some references that debug and release builds might end up doing very different things with mixing floats and doubles together especially on the x86 architecture. Why that is, how it works beneath the bonnet and why would that be the culprit is a bunch of details I'm curious about. Is there a ratified reason to the double/float fragmentation in the types of the system? My first instinct is that the floats are a means of optimisation, which might have been premature, and gradually the code base became sprinkled with floats and doubles quite randomly. Now, I have no clue if this is so, but knowing if it is deliberate would help figuring out what to be changed and what should be left alone. |
This is puzzling. The only way getting around this bug in my case has been to use doubles. For example:
Using floats at any point inside that method will crash with SIGFPE invalid floating point operation. Tried valgrind and also returning just 0.0 and 1.0. First constant errored by div/0, but the second worked fine. This would imply, that there's not necessarily any CTOR/DTOR/delete/pointer mysticism happening somewhere nearby. Valgrind had the same (non-)issue as the debug build - for some reason it doesn't crash. Thus a backtrace isn't helping much. Attached below is the disassembly code with a marker at the presumed crash location. Info log:
Here's the output file: Here's the disassembly on SIGFPE:
|
Propulsion::GetThrust(...)
If you can, please attach GDB information regarding the value of function parameters+locals, XMM1/3 registers, and backtrace from the crash. x86 |
Here's the contents of xmm0,1,3 registers: The thruster value is within bounds. It is passed as enum from this method:
After logging a bunch of calls, it seems as the culprit thruster seems to be number 4 which is the
The weird part is, that the crash happens in the last
It almost seems as if something happens in optimisations, because if I set the I did check the GCC bug reporting board, but could not find any reference to a similar bug due to optimisation or any other reason. Just in case, my gcc version is Am I really the only one this crash has occurred to :D For completeness, here's the working method for comparison. Adding
|
I'm still running GCC 12.2.0, and binary versions of the game are built with an older GCC.
This is the first time I've seen this bug reported before, and the first time I've ever seen anything close to this callsite. I'm pretty sure this is a new bug, and unless this can be tracked down to an invalid input...
I'm going to go out a very short distance on a limb here and blame GCC 13 for this bug. The fact that simply adding I am fine with this PR for the semantic reason of dealing with thrust values in the giganewtons with more accuracy, and if it happens to have the nice side effect of working around a defect when compiling with GCC 13 I'll take it.
To finally answer this question, we by convention use doubles for anything requiring a high level of precision (e.g. planetary coordinates, velocity vectors in 10s of km/s, etc.), as floats begin to lose usable precision around values of 3,000-30,000 depending on context. There are some performance reasons for using floats in certain spots, as doubles are half-rate if not slower, and some of the code in question dates back to an era when Pioneer was expected to run at the lowest spec on computers without even SSE. Most game engines use floats for performance reasons (cache occupancy and execution rate), with some engines judiciously using doubles when needed for large-world support. |
Still not certain of this. I tried to compile with an older version ( And I was all excited about finding a bug in gcc :D Have to try a few builds back and forth checking if some of the dependencies around it and the project have an effect. |
To succinctly explain, I was able to reproduce the bug in question on Clang 15 - the bug is caused by GCC+Clang auto-vectorizing GetThrustMin() and loading the low-half of a double value (which happens to be a NaN bitpattern) into an "unused" SIMD lane that happens to survive all the way to the As a result, I'm going to close this PR and replace it with one that simply disables FPEs during game runtime. Thank you @d3rp for reporting this bug and all the help in tracking it down! The long-form from IRC: (doing @impaktor's job for him 😛) 03:50 <sturnclaw> hahahahahaha
03:50 <sturnclaw> I'm going insane
03:50 <sturnclaw> clang and GCC both decide they're going to read past the end of the m_linAccelerationCap array when copying the last value to the stack
03:50 <sturnclaw> (er, newer-than-my-system GCC)
03:51 <sturnclaw> so they copy the next three dwords, which happen to be m_fuelTankMass and the low half of m_thrusterFuel
03:53 <Gliese852> sturnclaw: leaning towards a bug in the compilers, or do we have a formal UB somewhere?
03:54 <sturnclaw> it just so happens that the low-half of 0.8345220088704485 is exactly the right bitpattern to encode a QNaN which triggers a SIGFPE (specifically FE_INVALID) if it survives long enough to reach a minps instruction
03:55 <sturnclaw> Gliese852: definitely a compiler bug
03:55 <sturnclaw> strange that it's in multiple compilers though
03:56 <impaktor> Weird
03:56 <sturnclaw> this is an optimizer issue - it should absolutely not be permitted to load a 128-bit value "over the end" of a known-range array under "UB"
03:57 <sturnclaw> This is happening because of auto-vectorization of Propulsion::GetThrustMin() - all three GetThrust calls are unwrapped into the same function, and three lanes of an SSE minps instruction used to evaluate them
03:58 <sturnclaw> the "bandaid" fix is to mark the intermediate mass value volatile, which prevents the function from being vectorized
03:58 <sturnclaw> (basically just turning off the optimizer)
03:58 <sturnclaw> technically we could just turn off all floating point exceptions and call it a day
03:58 <sturnclaw> I'm very very tempted to do so
03:59 <sturnclaw> SIGFPE is not exactly useful for us
03:59 <impaktor> Should this bug be reported somewhere?
03:59 <sturnclaw> none of the results of our calculations actually matter enough to try to catch the signal
04:00 <sturnclaw> impaktor: if a minimal example could be reproduced, it should be filed with both GCC and Clang
04:00 <sturnclaw> unfortunately, minimal examples take more of my time to make than I intend to spend on this bug
04:00 <impaktor> Roger that
04:01 <sturnclaw> if clang/gcc will take bug reports that are "compile my repo at this commit and look at disassembly" then it's worth reporting
04:01 <sturnclaw> ...I don't want to deal with that though
04:01 <impaktor> I can drop a copy of this conversation in #5601, and see if anyone feeling up for it.
04:01 <cmdr-jameson> pull request #5601: Fix release build crash on arch linux https://github.com/pioneerspacesim/pioneer/pull/5601
04:04 <Gliese852> impaktor: starting with sturnclaw's hysterical laughter please
04:04 <sturnclaw> feel free
04:05 <sturnclaw> found the bug report by the way: https://github.com/llvm/llvm-project/issues/63708
04:05 <sturnclaw> the "solution" is just fp-strict and disable FPEs
04:05 <sturnclaw> clang says "lol, we don't care about floating point exceptions"
04:06 <sturnclaw> technically we "should" be using fp-strict given we have floating-point ops involved in terrain generation
04:06 <impaktor> ha!
04:06 <sturnclaw> unsure about the performance cost
04:06 <sturnclaw> *system generation, not terrain generation
[...]
04:17 <sturnclaw> bahahahahahahaha
04:17 <sturnclaw> -ffp-model=strict moves the source line the exception is generated on
04:17 <sturnclaw> but it doesn't prevent the vectorization that's responsible for the bug
04:18 <sturnclaw> now instead of copying the value to the stack and shuffling it through a bunch of registers
04:19 <sturnclaw> it just loads the extra bytes straight into the register that it later compares |
Replication
Expected
Error
Other notes
Potential bug
min
-something at Propulsion::GetThrustMin()Mitigation
Apologies for the .gitignore commit messing up the actual fix. I'm short on time and it was a reflex. :)