Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Add runner with Ubuntu on ARM64 #634

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from

Conversation

mmuetzel
Copy link
Contributor

GitHub started hosting runners with Ubuntu on ARM64 processors for open source projects for free:
https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/

Add one configuration that is using these runners to the build matrix.

According to their blog post, the arm64 runners are more efficient and potentially faster than the x86_64 runners. (But it is still in a preview phase and maybe it will take some time for them to better scale and balance the load.) If that turns out to be true, it should be easy to switch more configurations in that workflow to the arm64 runners (and maybe keep only one or two running on x86_64).

@mmuetzel
Copy link
Contributor Author

Oof. More tests are failing on that runner than I would have hoped.
At least some of the failing tests (H1BasisEvaluation, SD_H1BasisEvaluation, pointload2) also fail on macOS (Apple Silicon). Maybe, there are some assumptions somewhere that only hold for Intel/AMD processors?

I don't have physical access to ARM64 hardware. Not sure if I could help track down any of that.
Maybe, valgrind or some sanitizers would be able to find something that is odd also on Intel/AMD?

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@mmuetzel
Copy link
Contributor Author

Have you checked it's not a stack size problem ?

I might be wrong. But wouldn't that crash the program?
It looks like the numerics don't behave as expected. E.g., the last time steps before the test ConstantParamFunc failed:

  MAIN: 
  MAIN: -------------------------------------
  MAIN: Time: 34/50:   6.800E+00
  MAIN: -------------------------------------
  MAIN: 
  HeatSolver: Solving the energy equation for temperature
  HeatSolve: 
  HeatSolve: 
  HeatSolve: -------------------------------------
  HeatSolve:  TEMPERATURE ITERATION           1
  HeatSolve: -------------------------------------
  HeatSolve: 
  HeatSolve: Starting Assembly...
  HeatSolve: Assembly:
  HeatSolve: Assembly done
  ComputeChange: NS (ITER=1) (NRM,RELC): ( 0.16009565     0.62252546E-01 ) :: heat equation
  HeatSolve: iter:    1 Assembly: (s)    0.00    0.00
  HeatSolve: iter:    1 Solve:    (s)    0.00    0.00
  HeatSolve:  Result Norm   :   0.16009565005742657
  HeatSolve:  Relative Change :    6.2252546152800785E-002
  MAIN: 
  MAIN: -------------------------------------
  MAIN: Time: 35/50:   7.000E+00
  MAIN: -------------------------------------
  MAIN: 
  HeatSolver: Solving the energy equation for temperature
  HeatSolve: 
  HeatSolve: 
  HeatSolve: -------------------------------------
  HeatSolve:  TEMPERATURE ITERATION           1
  HeatSolve: -------------------------------------
  HeatSolve: 
  HeatSolve: Starting Assembly...
  HeatSolve: Assembly:
  HeatSolve: Assembly done
  ERROR:: IterSolve: Numerical Error: System diverged over maximum tolerance.

I don't know what could be causing that though...

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@mmuetzel
Copy link
Contributor Author

I switched the iterator on the ConstParamFunc, ConvergenceControl, HeatControlExplicit, TransientCostFourHeater and fsi_box tests (in devel now).

Thank you for looking into this.

I rebased this PR on top of your latest changes. Let's see if this will make a difference.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 23, 2025

That doesn't seem to have made much difference.
Still the following test failures:

The following tests FAILED:
	126 - ConstantParamFunc (Failed)                        serial
	155 - ConvergenceControl (Failed)                       serial transient
	292 - H1BasisEvaluation (Failed)                        benchmark serial
	301 - HeatControlExplicit (Failed)                      control quick serial
	433 - OptimizeSimplexFourHeaters (Failed)               control serial
	434 - OptimizeSimplexFourHeatersInt (Failed)            control serial
	540 - SD_H1BasisEvaluation (Failed)                     benchmark serendipity serial
	680 - TransientCostFourHeaters (Failed)                 serial
	801 - fsi_box (Failed)                                  elasticsolve fsi serial transient
	920 - pointload2 (Failed)                               serial

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 23, 2025 via email

@mmuetzel
Copy link
Contributor Author

Can I somehow misuse the git "push" triggered "Action" to do debugging (without waiting for too long) ?

Debugging using only GitHub actions is pretty tedious unfortunately. Unfortunately, I'm not aware of any option to log into the runners and run commands manually.
I would be very interested if someone finds out how to do that.

The next "best" thing (but still pretty bad) what I mostly resort to is:

  • Fork the repository to my own user account.
  • Enable running actions on that fork.
  • Optionally, disable any workflows that I'm currently not interested in. (This can be done on the "three-dot-menu" after selecting the respective workflow on the "Actions" tab.)
  • Optionally, modify the "configure" step of the workflow file in that fork to disable some build options (e.g., the GUI) to reduce the build time.
  • Optionally, modify the "check" step of the workflow file in that fork to reduce the number of tests that are being run.
  • Optionally, modify the source files in that fork to output more intermediate results.
  • Then iterate until there is some useful information.

That is still quite tedious and time consuming. It would be much easier if I knew how to stop in a debugger or anything like that in a GitHub action.

@juharu
Copy link
Contributor

juharu commented Jan 24, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 24, 2025 via email

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 24, 2025

An observation about the linear system iterator fails on ARM64 platform: It seems that the DNRM2() - function used by all iterative methods fails at random intervals. I don't know the reason. I think this is linked in from the openblas() - library here ?

Good finding!

Afaict, Ubuntu 24.04 distributes OpenBLAS version 0.3.26.

I found the following commit for upstream OpenBLAS that could be related:
OpenMathLib/OpenBLAS@3345007

I.e., they changed the implementation of DNRM2 on the Neoverse N2 due to inaccuracies. That processor is used on the GitHub runners.
IIUC, the first version with that change is OpenBLAS 0.3.29.

I'll try to switch to using the reference BLAS implementation on the ARM64 runner. Maybe, that will make a difference.

GitHub started hosting runners with Ubuntu on ARM64 processors for open
source projects for free:
https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/

Add one configuration that is using these runners to the build matrix.
@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 24, 2025

Switching to the reference BLAS implementation reduced the number of failing tests from 10 to four:

The following tests FAILED:
	292 - H1BasisEvaluation (Failed)                        benchmark serial
	540 - SD_H1BasisEvaluation (Failed)                     benchmark serendipity serial
	549 - SD_P2ndDerivatives (Failed)                       quick serial
	920 - pointload2 (Failed)                               serial
Error: Process completed with exit code 8.

That's a huge step forward. 🎉

The remaining test failures are mainly the same ones that are also failing on macOS. (Not sure if that is coincidental.)

The only other (and new) one might be because numerical results are slightly different using the reference implementation compared to an optimized implementation (like OpenBLAS). I haven't checked though...

@mmuetzel
Copy link
Contributor Author

Afaict, SD_P2ndDerivatives (the only new one with respect to the previously failing tests) is failing with an accuracy issue:

 Testing 2nd derivatives with p(8)...
(202)...PASSED (303)...PASSED (404)...PASSED (504)... dfyz:  0.85232585644089298       0.85232542765237440        4.2878851858052514E-007 >   4.2616292834281920E-007

@juharu
Copy link
Contributor

juharu commented Jan 24, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 25, 2025 via email

@juharu
Copy link
Contributor

juharu commented Jan 25, 2025 via email

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Jan 26, 2025

Good finding.
Maybe, a miscompilation by gfortran of some OpenMP constructs for that CPU (aarch64) and/or for that OS (macOS)? It's probably not an issue in the OpenMP implementation itself because Ubuntu uses OpenMP as implemented by GCC (libopenmp) while macOS uses the OpenMP implementation by LLVM (libomp).
It's probably also possible that there is an issue with how OpenMP is used somewhere in the sources (and it just "happens" to work correctly for some combinations of operating system and used CPU).

Do you know which OpenMP constructs are causing the test errors? It's hard to find a bug report or potential workarounds with only using these broad keywords...

Is a combination of OpenMP and -O3 needed to reproduce the issue? Or is one of them enough?

Anyway: If I understand correctly, OpenMP is only used in parts of elmerfem. And for most use cases, it is not crucial whether OpenMP is used or not. OpenMP is deactivated by default in the build rules.
Would it be ok to just stop testing OpenMP on platforms/CPUs where it turned out that it is causing errors?

While looking through the code, I found a couple of places in NavierStokes.F90 and SParIterSolver.F90 where !omp is used instead of the (documented) !$omp.
I don't know if those are just typos or whether those instructions are meant to be disabled.
I'll open a separate PR with fixes for these potential typos. (Not sure if that helps with the issues here though.)
Edit: See #636

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants