Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix release build crash on arch linux #5601

Closed
wants to merge 2 commits into from
Closed

Conversation

d3rp
Copy link
Contributor

@d3rp d3rp commented Aug 14, 2023

Replication

  • Build release build on arch linux
  • Start pioneer
  • Start a new game in Mars

Expected

  • game starts in Mars

Error

  • Game crashes before loading the cockpit screen

Other notes

  • Debug build starts normally

Potential bug

  • Debugger would show assembler instruction min-something at Propulsion::GetThrustMin()
  • GetThrustMin would cast double to float, compare and cast as return type. This could be the culprit

Mitigation

  • changing the types from float to double fixed this
  • Game starts normally with the release build

Apologies for the .gitignore commit messing up the actual fix. I'm short on time and it was a reflex. :)

@sturnclaw
Copy link
Member

This is extremely unlikely to be the actual cause of the crash. Multiplication and double-to-float conversion should not throw any floating-point exceptions and if there's a segfault it's not the fault of this specific line.

Additionally, without a backtrace + log or other detailed error context, I cannot approve this PR. "Game crashes" is very nebulous and could be related to several different issues.

@d3rp
Copy link
Contributor Author

d3rp commented Aug 15, 2023

I'm with you completely. I did check for nulls in case it's a race condition, but validating the design further is arduous in a new code base. The PR was mainly to suggest if this could be the fix. I'll look into why this actually fixed the issue when I have time for it as I found some references that debug and release builds might end up doing very different things with mixing floats and doubles together especially on the x86 architecture. Why that is, how it works beneath the bonnet and why would that be the culprit is a bunch of details I'm curious about.

Is there a ratified reason to the double/float fragmentation in the types of the system? My first instinct is that the floats are a means of optimisation, which might have been premature, and gradually the code base became sprinkled with floats and doubles quite randomly. Now, I have no clue if this is so, but knowing if it is deliberate would help figuring out what to be changed and what should be left alone.

@d3rp
Copy link
Contributor Author

d3rp commented Aug 15, 2023

This is puzzling. The only way getting around this bug in my case has been to use doubles. For example:

double Propulsion::GetThrust(Thruster thruster) const
{
	// acceleration = thrust / mass
	// thrust = acceleration * mass
	const double mass = m_dBody->GetMass();
	return std::min(
		static_cast<double>(m_linThrust[thruster]),
		mass * m_linAccelerationCap[thruster]);
}

Using floats at any point inside that method will crash with SIGFPE invalid floating point operation.

Tried valgrind and also returning just 0.0 and 1.0. First constant errored by div/0, but the second worked fine. This would imply, that there's not necessarily any CTOR/DTOR/delete/pointer mysticism happening somewhere nearby.

Valgrind had the same (non-)issue as the debug build - for some reason it doesn't crash. Thus a backtrace isn't helping much. Attached below is the disassembly code with a marker at the presumed crash location.

Info log:

Info: ver 20230814 (fca7c071b) on: Linux
Info: System Name: Linux
Host Name: minilith
Release(Kernel) Version: 6.4.9-arch1-1
Kernel Build Timestamp: #1 SMP PREEMPT_DYNAMIC Tue, 08 Aug 2023 22:14:05 +0000
Machine Arch: x86_64
Domain Name: (none)
Info: --------------------
Info: SDL Version (build) 2.28.2
Info: SDL Version (dynamic) 2.28.2
Info: SDL Versions match
Info: SDL_image Version (build): 2.6.3
Info: SDL_image Version (dynamic): 2.6.3
Info: SDL_image Versions match
Info: Assimp Version: 5.2.0
Info: FreeType Version: 2.13.1
Info: GLEW dynamic version: 2.2.0
Info: --------------------
Info: 
Info: Initialized OpenGL 3.1, with extensions, renderer
Info: Created shader vtxColor (address=0x55d4b88673d0)
Info: Initializing joystick subsystem.
Info: Found joystick 'Sony Computer Entertainment Wireless Controller' (GUID: 0300e8fe4c050000c405000011810000)
Info:   - 6 axes, 13 buttons, 1 hats
Info:   - axis 0: deadzone 0.04, curve: 1.00, half-axis mode: 0
Info:   - axis 1: deadzone 0.04, curve: 1.00, half-axis mode: 0
Info:   - axis 2: deadzone 0.04, curve: 1.00, half-axis mode: 1
Info:   - axis 3: deadzone 0.04, curve: 1.00, half-axis mode: 0
Info:   - axis 4: deadzone 0.04, curve: 1.00, half-axis mode: 0
Info:   - axis 5: deadzone 0.10, curve: 1.00, half-axis mode: 1
Info: font orbiteer:
Info: - Orbiteer-Bold.ttf 1.000000
Info: - DejaVuSans.ttf 1.200000
Info: - wqy-microhei.ttc 1.000000
Info: font pionillium:
Info: - PionilliumText22L-Medium.ttf 1.000000
Info: - DejaVuSans.ttf 0.928571
Info: - wqy-microhei.ttc 1.000000
Info: Created shader ui (address=0x55d4b81775c0)
Info: Created shader gen_gas_giant_colour (address=0x55d4b8192500)
Info: Loaded economy info: 3 economies, 31 commodities (10 consumable)
Info: started 15 worker threads in 2.22ms
Info: 

Engine startup took 862.95ms
Info: StartupScreen::Start()
Info: ShipType::Init()
Info: Lua::Init()
Info: Lua 5.2
Info: Loading [00%]: Sound::Init started
Warning: Commodity rubbish has invalid producing economy ALL
Info: Loading [00%]: Sound::Init took 427.12ms
Info: Loading [08%]: Lua::InitModules() started
Info: Ship Def found: coronatrix
Info: Ship Def found: coronatrix_police
Info: Ship Def found: sinonatrix
Info: Ship Def found: wave
Info: Ship Def found: venturestar
Info: Ship Def found: lunarshuttle
Info: Ship Def found: varada
Info: Ship Def found: storeria
Info: Ship Def found: pumpkinseed
Info: Ship Def found: kanara
Info: Ship Def found: natrix
Info: Ship Def found: dsminer
Info: Ship Def found: bowfin
Info: Ship Def found: bluenose
Info: Ship Def found: kanara_civ
Info: Ship Def found: ac33
Info: Ship Def found: nerodia
Info: Ship Def found: sinonatrix_police
Info: Ship Def found: malabar
Info: Ship Def found: pumpkinseed_police
Info: Ship Def found: skipjack
Info: Ship Def found: vatakara
Info: Ship Def found: molamola
Info: Ship Def found: missile_unguided
Info: Ship Def found: missile_naval
Info: Ship Def found: lodos
Info: Ship Def found: xylophis
Info: Ship Def found: missile_guided
Info: Ship Def found: missile_smart
Info: Ship Def found: deneb
Info: Ship Def found: molaramsayi
Info: 31
Info: Loading [08%]: Lua::InitModules() took 91.18ms
Info: Loading [17%]: GalaxyGenerator::Init() started
Info: Creating new galaxy generator 'legacy' version 1
Info: Warning: 'mass' is 0.000000 for body 'Themisto'
Info: Warning: 'mass' is 0.000000 for body 'Lucksmall'
Info: Number of factions added: 103
Info: StarSystemCache: misses: 0, slave hits: 0, master hits: 0
Info: SectorCache: misses: 101, slave hits: 0, master hits: 2
Info: Loading [17%]: GalaxyGenerator::Init() took 13.06ms
Info: Loading [25%]: FaceParts::Init() started
Info: Face Generation source images loaded.
Info: Loading [25%]: FaceParts::Init() took 4.37ms
Info: Loading [33%]: new ModelCache started
Info: Loading [33%]: new ModelCache took 0.00ms
Info: Loading [42%]: Shields::Init started
Info: Created shader shield (address=0x55d4b97c8e30)
Info: Loading [42%]: Shields::Init took 0.45ms
Info: Loading [50%]: BaseSphere::Init started
Info: GenerateIndices: triangles count = 2312, mid indexes = 6528, hi edges = 102
Info: Loading [50%]: BaseSphere::Init took 48.06ms
Info: Loading [58%]: CityOnPlanet::Init started
Info: Loading city definition file 'configs/buildings/default.json'
Info: Created shader multi (address=0x55d4b96d97b0)
Info: 	Created city definition flavour #1
Info: Loading [58%]: CityOnPlanet::Init took 112.84ms
Info: Loading [67%]: SpaceStation::Init started
Info: Created shader label (address=0x55d4bd3f3420)
Info: ground_station has:
 14 entrances,
 14 pads,
 0 exits
Info: new_ground has:
 6 entrances,
 6 pads,
 0 exits
Info: orbital_station_2-10k has:
 5 entrances,
 64 pads,
 5 exits
Info: orbital_station_2-2k has:
 4 entrances,
 18 pads,
 4 exits
Info: orbital_station_2-5k has:
 5 entrances,
 64 pads,
 5 exits
Info: orbital_station_2-5k10k has:
 5 entrances,
 64 pads,
 5 exits
Info: Loading [67%]: SpaceStation::Init took 247.65ms
Info: Loading [75%]: NavLights::Init started
Info: Created shader billboards (address=0x55d4c35a6450)
Info: Loading [75%]: NavLights::Init took 0.34ms
Info: Loading [83%]: Sfx::Init started
Info: Loading [83%]: Sfx::Init took 2.34ms
Info: Loading [92%]: PostLoad started
Info: Loading [92%]: PostLoad took 0.02ms
Info: 

Pioneer loading took 1060.64ms
Info: Creating new galaxy generator 'legacy' version 1
Info: Clearing and re-using previous Galaxy object
Info: StarSystemCache: misses: 0, slave hits: 0, master hits: 0
Info: SectorCache: misses: 101, slave hits: 0, master hits: 1
Info: Created shader starfield (address=0x55d4be208380)
Info: Stars picked from galaxy: 106058
Info: Generating 18942 random stars
Info: Final stars number: 125000
Info: Created shader skybox (address=0x55d4bd3bb620)
Info: Created shader unlit (address=0x55d4cb83deb0)
Info: Creating new galaxy generator 'legacy' version 1
Info: Clearing and re-using previous Galaxy object
Info: StarSystemCache: misses: 0, slave hits: 0, master hits: 0
Info: SectorCache: misses: 0, slave hits: 0, master hits: 0
Info: Stars picked from galaxy: 106058
Info: Generating 18942 random stars
Info: Final stars number: 125000
Info: Creating new galaxy generator 'legacy' version 1
Info: Clearing and re-using previous Galaxy object
Info: StarSystemCache: misses: 0, slave hits: 0, master hits: 0
Info: SectorCache: misses: 0, slave hits: 0, master hits: 0
Info: Stars picked from galaxy: 106058
Info: Generating 18942 random stars
Info: Final stars number: 125000
Info: Warning: Lua custom Systems definition: Surface starport has been automatically relocated. This is in order to place it on flatter ground to reduce the chance of landing pads being buried. This is not an error as such and you may attempt to move the starport to another location by changing latitude and longitude fields.
      Surface starport name: Cydonia, Body name: Mars, In sector: x = 0, y = 0, z = 0.
Info: Warning: Lua custom Systems definition: Surface starport has been automatically relocated. This is in order to place it on flatter ground to reduce the chance of landing pads being buried. This is not an error as such and you may attempt to move the starport to another location by changing latitude and longitude fields.
      Surface starport name: Tomm's Sanctuary, Body name: Deimos, In sector: x = 0, y = 0, z = 0.
Info: Warning: Lua custom Systems definition: Surface starport has been automatically relocated. This is in order to place it on flatter ground to reduce the chance of landing pads being buried. This is not an error as such and you may attempt to move the starport to another location by changing latitude and longitude fields.
      Surface starport name: Thebe Gas Refinery, Body name: Thebe, In sector: x = 0, y = 0, z = 0.
Info: Warning: Lua custom Systems definition: Surface starport has been automatically relocated. This is in order to place it on flatter ground to reduce the chance of landing pads being buried. This is not an error as such and you may attempt to move the starport to another location by changing latitude and longitude fields.
      Surface starport name: Pluto Research Base, Body name: Pluto, In sector: x = 0, y = 0, z = 0.
Info: Created shader sphereimpostor (address=0x55d4e2857c90)
Info: Listening for headtracking packets on 127.0.0.1:4242
Info: WARNING: ship varada discarded, because there are no hyperspace routes available for it
Info: WARNING: lowering ships flow by 0.69 for performance reasons
Signal: SIGFPE (signal SIGFPE: invalid floating point operation)

Here's the output file:

output.txt

Here's the disassembly on SIGFPE:

Propulsion::GetThrustMin() const:
	pushq  %rbx
	subq   $0x30, %rsp
	movq   %rdi, %rbx
	movq   0x98(%rdi), %rdi
	movq   (%rdi), %rax
	callq  *0x58(%rax)
	movaps %xmm0, 0x10(%rsp)
	movq   0x98(%rbx), %rdi
	movq   (%rdi), %rax
	movss  0x14(%rbx), %xmm0          ; xmm0 = mem[0],zero,zero,zero
	movaps %xmm0, 0x20(%rsp)
	movss  0x60(%rbx), %xmm0          ; xmm0 = mem[0],zero,zero,zero
	movapd %xmm0, (%rsp)
	callq  *0x58(%rax)
	unpcklpd 0x10(%rsp), %xmm0        ; xmm0 = xmm0[0],mem[0]
	cvtpd2ps %xmm0, %xmm0
	movups 0x20(%rbx), %xmm1
	movups 0x6c(%rbx), %xmm2
	movaps (%rsp), %xmm3
	movlhps %xmm2, %xmm3              ; xmm3 = xmm3[0],xmm2[0]
	shufps $0xe2, %xmm2, %xmm3        ; xmm3 = xmm3[2,0],xmm2[2,3]
	mulps  %xmm0, %xmm3
	movaps 0x20(%rsp), %xmm0
	movlhps %xmm1, %xmm0              ; xmm0 = xmm0[0],xmm1[0]
	shufps $0xe2, %xmm1, %xmm0        ; xmm0 = xmm0[2,0],xmm1[2,3]
>>>	minps  %xmm0, %xmm3
	movaps %xmm3, %xmm0
	shufps $0x55, %xmm3, %xmm0        ; xmm0 = xmm0[1,1],xmm3[1,1]
	minss  %xmm0, %xmm3
	movaps %xmm3, (%rsp)
	movq   0x98(%rbx), %rdi
	movq   (%rdi), %rax
	callq  *0x58(%rax)
	cvtsd2ss %xmm0, %xmm0
	mulss  0x68(%rbx), %xmm0
	minss  0x1c(%rbx), %xmm0
	minss  (%rsp), %xmm0
	cvtss2sd %xmm0, %xmm0
	addq   $0x30, %rsp
	popq   %rbx
	retq

@sturnclaw
Copy link
Member

If you can, please attach GDB information regarding the value of function parameters+locals, XMM1/3 registers, and backtrace from the crash. x86 minps indicates that it only raises a FPE on NANs and denormals, so either the thruster parameter is out-of-range and there's addressing beyond the array bounds (something that valgrind wouldn't necessarily catch in this case) or there's something weirder going on.

@d3rp
Copy link
Contributor Author

d3rp commented Aug 16, 2023

Here's the contents of xmm0,1,3 registers:
gdb.txt

The thruster value is within bounds. It is passed as enum from this method:

double Propulsion::GetThrustMin() const
{
	// These are the weakest thrusters in a ship
	double val = GetThrust(THRUSTER_UP);
	val = std::min(val, GetThrust(THRUSTER_RIGHT));
	val = std::min(val, GetThrust(THRUSTER_LEFT));
	return val;
}

After logging a bunch of calls, it seems as the culprit thruster seems to be number 4 which is the THRUSTER_LEFT below. However, the GetThrustMin gets called multiple times in the same order before it fails. Presumably these are other ships, but I couldn't catch which specific one was the last one before crashing.

enum Thruster { // <enum scope='Thruster' name=ShipTypeThruster prefix=THRUSTER_ public>
	THRUSTER_REVERSE,
	THRUSTER_FORWARD,
	THRUSTER_UP,
	THRUSTER_DOWN,
	THRUSTER_LEFT,
	THRUSTER_RIGHT,
	THRUSTER_MAX // <enum skip>
};

The weird part is, that the crash happens in the last val assignment of GetThrustMin which itself calls GetThrust. These calls are not shown in the previous assembler listing as presumably the optimised code holds no clear references to the individual GetThrust calls.

double Propulsion::GetThrust(Thruster thruster) const
{
	// acceleration = thrust / mass
	// thrust = acceleration * mass
	const float mass = static_cast<float>(m_dBody->GetMass());
	return std::min(
		m_linThrust[thruster],
		m_linAccelerationCap[thruster] * mass);
}

It almost seems as if something happens in optimisations, because if I set the mass variable as volatile, it does not crash. Also it is used in std::min and passed to the callee's std::min so could it be the optimiser is too greedy on this and takes too big of a bite? Sounds weird. I found this out when I used the mass variable to log-print and suddenly it did not crash again.

I did check the GCC bug reporting board, but could not find any reference to a similar bug due to optimisation or any other reason.

Just in case, my gcc version is gcc (GCC) 13.2.1 20230801

Am I really the only one this crash has occurred to :D

For completeness, here's the working method for comparison. Adding volatile makes the release build run normally - without it the game crashes.

double Propulsion::GetThrust(Thruster thruster) const
{
	// acceleration = thrust / mass
	// thrust = acceleration * mass
	volatile const float mass = static_cast<float>(m_dBody->GetMass());
	return std::min(
		m_linThrust[thruster],
		m_linAccelerationCap[thruster] * mass);
}

@sturnclaw
Copy link
Member

Just in case, my gcc version is gcc (GCC) 13.2.1 20230801

I'm still running GCC 12.2.0, and binary versions of the game are built with an older GCC.

Am I really the only one this crash has occurred to :D

This is the first time I've seen this bug reported before, and the first time I've ever seen anything close to this callsite. I'm pretty sure this is a new bug, and unless this can be tracked down to an invalid input...

For completeness, here's the working method for comparison. Adding volatile makes the release build run normally - without it the game crashes.

I'm going to go out a very short distance on a limb here and blame GCC 13 for this bug. The fact that simply adding volatile generates code which doesn't raise an FPE indicates to me there's an incorrect optimization taking place which is either leaving a variable uninitialized/invalid after conversion, or exposing a deeper bug with an uninitialized variable as one of the inputs to the calculation taking place. Without being able to replicate it myself, I cannot offer much more help with tracking down the reason at this time.

I am fine with this PR for the semantic reason of dealing with thrust values in the giganewtons with more accuracy, and if it happens to have the nice side effect of working around a defect when compiling with GCC 13 I'll take it.

Is there a ratified reason to the double/float fragmentation in the types of the system? My first instinct is that the floats are a means of optimisation, which might have been premature, and gradually the code base became sprinkled with floats and doubles quite randomly. Now, I have no clue if this is so, but knowing if it is deliberate would help figuring out what to be changed and what should be left alone.

To finally answer this question, we by convention use doubles for anything requiring a high level of precision (e.g. planetary coordinates, velocity vectors in 10s of km/s, etc.), as floats begin to lose usable precision around values of 3,000-30,000 depending on context. There are some performance reasons for using floats in certain spots, as doubles are half-rate if not slower, and some of the code in question dates back to an era when Pioneer was expected to run at the lowest spec on computers without even SSE. Most game engines use floats for performance reasons (cache occupancy and execution rate), with some engines judiciously using doubles when needed for large-world support.

@d3rp
Copy link
Contributor Author

d3rp commented Aug 18, 2023

Still not certain of this. I tried to compile with an older version (g++-12 (GCC) 12.3.0). Though that version did run without crashing, so did the next sanity check built again with the g++13. Last time I tried, the builds would consistently crash - now they consistently work.

And I was all excited about finding a bug in gcc :D Have to try a few builds back and forth checking if some of the dependencies around it and the project have an effect.

@sturnclaw
Copy link
Member

To succinctly explain, I was able to reproduce the bug in question on Clang 15 - the bug is caused by GCC+Clang auto-vectorizing GetThrustMin() and loading the low-half of a double value (which happens to be a NaN bitpattern) into an "unused" SIMD lane that happens to survive all the way to the minps instruction. There is no fix for this, -ffp-model=strict only changes the amount of indirection but doesn't affect the load-past-end. As a result, our only option is to run with FPEs disabled.

As a result, I'm going to close this PR and replace it with one that simply disables FPEs during game runtime. Thank you @d3rp for reporting this bug and all the help in tracking it down!

The long-form from IRC: (doing @impaktor's job for him 😛)

03:50 <sturnclaw> hahahahahaha
03:50 <sturnclaw> I'm going insane
03:50 <sturnclaw> clang and GCC both decide they're going to read past the end of the m_linAccelerationCap array when copying the last value to the stack
03:50 <sturnclaw> (er, newer-than-my-system GCC)
03:51 <sturnclaw> so they copy the next three dwords, which happen to be m_fuelTankMass and the low half of m_thrusterFuel
03:53 <Gliese852> sturnclaw: leaning towards a bug in the compilers, or do we have a formal UB somewhere?
03:54 <sturnclaw> it just so happens that the low-half of 0.8345220088704485 is exactly the right bitpattern to encode a QNaN which triggers a SIGFPE (specifically FE_INVALID) if it survives long enough to reach a minps instruction
03:55 <sturnclaw> Gliese852: definitely a compiler bug
03:55 <sturnclaw> strange that it's in multiple compilers though
03:56 <impaktor> Weird
03:56 <sturnclaw> this is an optimizer issue - it should absolutely not be permitted to load a 128-bit value "over the end" of a known-range array under "UB"
03:57 <sturnclaw> This is happening because of auto-vectorization of Propulsion::GetThrustMin() - all three GetThrust calls are unwrapped into the same function, and three lanes of an SSE minps instruction used to evaluate them
03:58 <sturnclaw> the "bandaid" fix is to mark the intermediate mass value volatile, which prevents the function from being vectorized
03:58 <sturnclaw> (basically just turning off the optimizer)
03:58 <sturnclaw> technically we could just turn off all floating point exceptions and call it a day
03:58 <sturnclaw> I'm very very tempted to do so
03:59 <sturnclaw> SIGFPE is not exactly useful for us
03:59 <impaktor> Should this bug be reported somewhere? 
03:59 <sturnclaw> none of the results of our calculations actually matter enough to try to catch the signal
04:00 <sturnclaw> impaktor: if a minimal example could be reproduced, it should be filed with both GCC and Clang
04:00 <sturnclaw> unfortunately, minimal examples take more of my time to make than I intend to spend on this bug
04:00 <impaktor> Roger that
04:01 <sturnclaw> if clang/gcc will take bug reports that are "compile my repo at this commit and look at disassembly" then it's worth reporting 
04:01 <sturnclaw> ...I don't want to deal with that though
04:01 <impaktor> I can drop a copy of this conversation in #5601, and see if anyone feeling up for it.
04:01 <cmdr-jameson> pull request #5601: Fix release build crash on arch linux https://github.com/pioneerspacesim/pioneer/pull/5601
04:04 <Gliese852> impaktor: starting with sturnclaw's hysterical laughter please
04:04 <sturnclaw> feel free
04:05 <sturnclaw> found the bug report by the way: https://github.com/llvm/llvm-project/issues/63708
04:05 <sturnclaw> the "solution" is just fp-strict and disable FPEs
04:05 <sturnclaw> clang says "lol, we don't care about floating point exceptions"
04:06 <sturnclaw> technically we "should" be using fp-strict given we have floating-point ops involved in terrain generation
04:06 <impaktor> ha!
04:06 <sturnclaw> unsure about the performance cost
04:06 <sturnclaw> *system generation, not terrain generation
[...]
04:17 <sturnclaw> bahahahahahahaha
04:17 <sturnclaw> -ffp-model=strict moves the source line the exception is generated on
04:17 <sturnclaw> but it doesn't prevent the vectorization that's responsible for the bug
04:18 <sturnclaw> now instead of copying the value to the stack and shuffling it through a bunch of registers
04:19 <sturnclaw> it just loads the extra bytes straight into the register that it later compares

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants