Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail on x86 #2691

Closed
theMarix opened this issue Jun 23, 2023 · 25 comments · Fixed by #2737, #2740 or #2741
Closed

Tests fail on x86 #2691

theMarix opened this issue Jun 23, 2023 · 25 comments · Fixed by #2737, #2740 or #2741
Assignees
Milestone

Comments

@theMarix
Copy link
Contributor

Bug Report

Current behavior

When building on x86 (32 bit, not x86_64), the test suit fails with "117 failures are detected in the test module "Gridcoin Test Suite". This happens both on master as well as on the current development branch. Running the tests on x86_64 works as expected.

Expected behavior

Expected behaviour would be for the test suite to succeed on x86.

Steps to reproduce:

I run the following on an x86 Linux:

./autogen.sh
./configure --disable-bench --with-miniupnpc --with-qrencode --with-gui=qt5
make -j8 && VERBOSE=1 make check

This results in the following (shortened) output:

============================================================================
Testsuite summary for Gridcoin 5.4.5
============================================================================
# TOTAL: 2
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See src/test-suite.log
Please report to https://github.com/gridcoin/Gridcoin-Research/issues
============================================================================

I have attached a copy of the test-suite.log

You can reproduce the issue on an x86_64 host by using the attached Dockerfile and running the build with the x86 arch explicitly selected:

podman build --arch 386  -f Dockerfile .

Gridcoin version

  • master @ 5.4.5.0
  • development @ d614a0fa

Machine specs

  • OS: Linux
  • CPU: x86
@theMarix theMarix added the bug label Jun 23, 2023
@jamescowens
Copy link
Member

This is very curious. The failures are of the type...

test/wallet_tests.cpp(245): error: in "wallet_tests/coin_selection_tests": check nValueRet == 1.0105 * COIN has failed [101049999 != 101050000]

Which looks like a floating point rounding error.

Note That CI tests 32 bit linux for every PR merge and it is passing. See https://github.com/gridcoin-community/Gridcoin-Research/actions/runs/5292472929/jobs/9579392573 for example.

@jamescowens jamescowens added this to the Miss Piggy milestone Jun 24, 2023
@jamescowens
Copy link
Member

Can you provide more specifics of the Linux distribution/kernel version/etc you are running? My guess is that it is a version of openSUSE :).

@theMarix
Copy link
Contributor Author

` Note That CI tests 32 bit linux for every PR merge and it is passing. See https://github.com/gridcoin-community/Gridcoin-Research/actions/runs/5292472929/jobs/9579392573 for example.

This is good to know, as I wasn't sure if 32-bit x86 is actually even something that's meant to be supported.

Can you provide more specifics of the Linux distribution/kernel version/etc you are running? My guess is that it is a version of openSUSE :).

Yes. This is one openSUSE. Specifically, I ran into this with the 32-bit package for openSUSE Tumbleweed. To be fair, that already started a while ago but I only now had the time to look into it. The kernel in that case is Linux version: 6.3.9-1-default #1 SMP PREEMPT_DYNAMIC Thu Jun 22 03:53:43 UTC 2023 (0df701d). You can see the build log for those builds at https://build.opensuse.org/public/build/home:theMarix/openSUSE_Tumbleweed/i586/gridcoin/_log.

When reproducing via Podman, the kernel was 5.14.21-150500.53-default #1 SMP PREEMPT_DYNAMIC Wed May 10 07:56:26 UTC 2023 (b630043) x86_64 x86_64 x86_64 GNU/Linux which is quite a bit older, as I was running on Leap 15.5 in that case.

As in both cases the kernels are actually 64 bit ones (you can also see that from the build log on the OBS where it states machine type: x86_64) and the 64 bit builds on the same machines are fine, I suspect this is more related to the gcc version or some build flags than to the kernel.

@theMarix
Copy link
Contributor Author

So it seems this might be related to the combination of GCC13 and 32bit x86. Based on our discussion I just retried my reproducing example but explicitly selected GCC12 and in that case the tests pass.

@jamescowens
Copy link
Member

Hmm... do you think this is a compiler bug?

@theMarix
Copy link
Contributor Author

Might be a compiler bug. But might also be some edge-case that only this specific compiler exposes. A calculation like 1.0105 * COIN might strongly depend on how and in which order types are cast and I am not 100% sure on what the C and C++ specs say in that case.

@jamescowens
Copy link
Member

It very well could be an implicit casting issue. 1.0105 as a literal is normally a float. COIN is CAmount which is really int64_t underneath. Let me put casting statements to ensure the casting is being done the way we want it and we can recheck.

@jamescowens
Copy link
Member

jamescowens commented Jun 26, 2023

No dice even with explicit casting. I really think this is a compiler problem.

@jamescowens
Copy link
Member

See jamescowens@32da816 and the log:

output_fix_wallet_tests_gcc13_3.log

Notice that the changeover in integer arithmetic in that particular spot cleared the error. There are remaining ones.

All of these are conversions to/from floating point, and these are properly coded.

I think we really must be looking at a compiler issue.

@theMarix
Copy link
Contributor Author

Thanks for looking into this. I have to admit, there's two hearts within me regarding this issue. As a passionate software engineer, I'd love to get this down to a minimal example, report it upstream, and have the world improved for everybody. But as a packager with limited (time) resources, "just don't build on GCC 13 when on 32-bit x86" is what I'll be going for.

How far you want to carry this as the project: just relying on users not running builds for which the tests fail, adding an explicit warning, fixing it within your codebase, or reporting this upstream, is your call.

@theMarix
Copy link
Contributor Author

One more thing I noticed, it seems that on Fedora 32-bit x86 the build using GCC 13 works fine. Thus, this seems to be specific to either the GCC 13 shipped by SUSE or some build flags used by SUSE but not by Fedora.

@div72
Copy link
Member

div72 commented Jun 27, 2023

I don't exactly understand why you think the compiler has a bug in this case. In float literal * COIN, the CAmount would be promoted to a double and a floating point multiplication would be done. The fact that 1.0105 * COIN gets optimized to 101050000 during compile time should be implemention-defined behaviour. I feel like our code is wrong here.

@jamescowens
Copy link
Member

@div72, that is actually not where the failure is. It is on the other side of the test equality and stems from add_coin(0.0005 * COIN); at 239.

There are also failures like: this: test/gridcoin/researcher_tests.cpp(237): error: in "MiningProject/it_parses_a_project_xml_string": check project.m_rac == 123.45 has failed.

An inspection of this is that MiningProject::Parse(const std::string& xml) is not returning a float of exactly 123.45. This function calls bool ParseDouble(const std::string& str, double *out)
on the extracted RAC, which in this case is the string "123.45". I am no expert in IEEE spec and compiler specs, but 123.45 converted to float should not be anything other than 123.45, not 123.499999999...

You can argue all you want that this is implementation specific behavior, and I will counter-argue that this same code has not ever failed across many platforms and compilers.

Something is wrong with GCC 13 in 32 bit x86.

@jamescowens
Copy link
Member

jamescowens commented Jul 5, 2023

Ok. I just looked up the IEEE 754 spec.

For double precision, a number is represented by (−1)^(sign bit) * (1+fraction) * 2^(exponent - bias), where sign bit is 0 for positive and 1 for negative, the fraction is a binary fraction that can have up to 52 bits, and the exponent up to 11 bits, with a bias of 1023.

To convert the string 123.45 to an equivalent binary IEEE 754 double...

First the sign is positive, so the sign bit is 0.

Second, to get the exponent we do

floor(ln(123.45) / ln(2)), so the exponent is 6.
Including the bias of 1023, we have 1023 + 6 = 1029 = 10000000101 (binary) for the exponent

Third, to get the fraction, we do

fraction = 123.45 / 2^6 - 1 = 0.92890625 (exact in decimal)

In binary: to 52 bits we have 1110110111001100110011001100110011001100110011001101

Notice there is a repeating cell 0011 that repeats forever.

Given that 52 bits in this instance is an approximation the maximum error due to a change of 1 bit at the 52nd place:

(1 + 2^(-52)) * 2^6 - 1 * 2^6 = 2^(-52) * 2^6 = 2^(-52) = 2.22E-16. Since this is smaller than the last decimal in original number (123.45) by far, the error induced by truncating the repeating binary does not result in an error in representation, so effectively 123.45 can be represented exactly.

I would say this is compiler error. It isn't following the IEEE 754 standard.

@jamescowens
Copy link
Member

@theMarix are we still seeing this?

@theMarix
Copy link
Contributor Author

theMarix commented Dec 6, 2023

I'll need to check as I had pinned the package build to GCC 12 to avoid not having up-to-date packages on x86.

@theMarix
Copy link
Contributor Author

theMarix commented Dec 7, 2023

I can still reproduce with the latest development branch on current openSUSE Tumbleweed x86.

@jamescowens
Copy link
Member

jamescowens commented Feb 5, 2024

This jamescowens@fb06203 should fix it. Here I make use of the augmented Fraction class/Allocation class that I did for PR #2704.

@jamescowens
Copy link
Member

@theMarix can you try #2737 and verify it clears the issue?

@theMarix
Copy link
Contributor Author

I'll give it a look as soon as I can. I currently have to deal with some issues that prevent me from spending a lot of time at my workstation. So this might take until the next weekend.

@theMarix
Copy link
Contributor Author

@jamescowens , Just tested on the latest development branch and on your commit. Both cases still show the error.

@jamescowens
Copy link
Member

jamescowens commented Feb 17, 2024 via email

@jamescowens jamescowens reopened this Feb 18, 2024
@jamescowens
Copy link
Member

@theMarix I just tried the podman script you gave and am getting several errors. I had to add gawk and got farther, but then am getting errors with the bdb53 compile indicating a mixed arch problem. Did you update the Dockerfile script since the earlier runs above?

@theMarix
Copy link
Contributor Author

@jamescowens , it's weird that you had to update the file. But it is possible that some updates were needed over time. I probably adjusted those on the fly when updating the spec for the actual packaging so I might have forgotten to update it here. Attached the latest I've been using.
Dockerfile.suse.txt

Further, the following is the complete build and test log from running podman build --arch 386 -f Dockerfile.suse.txt . on commit 871aad2:
suse-x86.log

@jamescowens
Copy link
Member

jamescowens commented Feb 22, 2024

You had definitely made tweaks. The one above worked. :)

Here is a grepped version of the test output on the current dev branch. The file which I made changes is not generating any errors. We have errors coming from the magnitude to float conversion which has not failed anywhere else, and quite frankly should not fail. This is definitely a compiler problem. The compiler is out of spec. I have to think carefully about what to do here, because I do not want to weaken the tests simply to make this special case pass.

OpenSUSE i386 test errors.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment