-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Add runner with Ubuntu on ARM64 #634
base: devel
Are you sure you want to change the base?
Conversation
Oof. More tests are failing on that runner than I would have hoped. I don't have physical access to ARM64 hardware. Not sure if I could help track down any of that. |
Hi
Thanks again for your work!
Certainly shouldn't be anything processor specific here. I can also try running with valgrind.
Have you checked it's not a stack size problem ?
Br, Juha
From: "Markus Mützel" ***@***.***>
To: "ElmerCSC" ***@***.***>
Cc: "Subscribed" ***@***.***>
Sent: Thursday, 23 January, 2025 10:22:02
Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634)
Oof. More tests are failing on that runner than I would have hoped.
At least some of the failing tests ( H1BasisEvaluation , SD_H1BasisEvaluation , pointload2 ) also fail on macOS (Apple Silicon). Maybe, there are some assumptions somewhere that only hold for Intel/AMD processors?
I don't have physical access to ARM64 hardware. Not sure if I could help track down any of that.
Maybe, valgrind or some sanitizers would be able to find something that is odd also on Intel/AMD?
—
Reply to this email directly, [ #634 (comment) |
view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ACTOMSW3XLWLF5WMXNWRUNT2MCRCVAVCNFSM6AAAAABVVLY37GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBZGEZTMMJTHE |
unsubscribe ] .
You are receiving this because you are subscribed to this thread. Message ID: ***@***.***>
The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: [ https://csc.fi/en/privacy | Privacy ]
Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: [ https://csc.fi/tietosuoja | Tietosuoja ]
|
valgrind runs are clean on my unubtu laptop ...
From: "Juha Ruokolainen" ***@***.***>
To: "ElmerCSC" ***@***.***>
Cc: "ElmerCSC" ***@***.***>, "Subscribed" ***@***.***>
Sent: Thursday, 23 January, 2025 10:27:59
Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634)
Hi
Thanks again for your work!
Certainly shouldn't be anything processor specific here. I can also try running with valgrind.
Have you checked it's not a stack size problem ?
Br, Juha
From: "Markus Mützel" < ***@***.*** >
To: "ElmerCSC" < ***@***.*** >
Cc: "Subscribed" < ***@***.*** >
Sent: Thursday, 23 January, 2025 10:22:02
Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634)
Oof. More tests are failing on that runner than I would have hoped.
At least some of the failing tests ( H1BasisEvaluation , SD_H1BasisEvaluation , pointload2 ) also fail on macOS (Apple Silicon). Maybe, there are some assumptions somewhere that only hold for Intel/AMD processors?
I don't have physical access to ARM64 hardware. Not sure if I could help track down any of that.
Maybe, valgrind or some sanitizers would be able to find something that is odd also on Intel/AMD?
—
Reply to this email directly, [ #634 (comment) |
view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ACTOMSW3XLWLF5WMXNWRUNT2MCRCVAVCNFSM6AAAAABVVLY37GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBZGEZTMMJTHE |
unsubscribe ] .
You are receiving this because you are subscribed to this thread. Message ID: < ***@***.*** >
The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: [ https://csc.fi/en/privacy | Privacy ]
Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: [ https://csc.fi/tietosuoja | Tietosuoja ]
|
I might be wrong. But wouldn't that crash the program?
I don't know what could be causing that though... |
Right, somewhat harder to figure out then, i guess, at least with nothing to test on....
|
The following tests FAILED:
All the following tests suddenly break the linear system iterator. All use the same BiCGstab- scheme.
Don't know if there is something wrong with it, maybe miscopiled, maybe the numerical scheme is
at fault (it's known to have some issues, the specifics escape me atm.). The remedy might be to change the
iterator scheme, f.ex. BiCGStab -> BiCGStabL ?
126 - ConstantParamFunc (Failed) serial
155 - ConvergenceControl (Failed) serial transient
301 - HeatControlExplicit (Failed) control quick serial
433 - OptimizeSimplexFourHeaters (Failed) control serial
434 - OptimizeSimplexFourHeatersInt (Failed) control serial
680 - TransientCostFourHeaters (Failed) serial
801 - fsi_box (Failed) elasticsolve fsi serial transient
I'll try to look at these, but mostly these might be inconsequential ...
540 - SD_H1BasisEvaluation (292 -
292 - H1BasisEvaluation (Failed) benchmark serial
Slightly different result norm. Someone should have a look, whether OK anyway...
920 - pointload2 (Failed) serial
|
I switched the iterator on the ConstParamFunc, ConvergenceControl, HeatControlExplicit, TransientCostFourHeater and fsi_box tests (in devel now).
I left out the OptimizeSimplex- tests for the time being. These result norm of these tests seem to be somewhat unstable to changes of:
1) Iterator scheme
2) Convergence Tolerance
3) Preconditoning
Br, Juha
From: "Juha Ruokolainen" ***@***.***>
To: "ElmerCSC" ***@***.***>
Cc: "ElmerCSC" ***@***.***>, "juharu" ***@***.***>, "Comment" ***@***.***>
Sent: Thursday, 23 January, 2025 11:54:03
Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634)
The following tests FAILED:
All the following tests suddenly break the linear system iterator. All use the same BiCGstab- scheme.
Don't know if there is something wrong with it, maybe miscopiled, maybe the numerical scheme is
at fault (it's known to have some issues, the specifics escape me atm.). The remedy might be to change the
iterator scheme, f.ex. BiCGStab -> BiCGStabL ?
126 - ConstantParamFunc (Failed) serial
155 - ConvergenceControl (Failed) serial transient
301 - HeatControlExplicit (Failed) control quick serial
433 - OptimizeSimplexFourHeaters (Failed) control serial
434 - OptimizeSimplexFourHeatersInt (Failed) control serial
680 - TransientCostFourHeaters (Failed) serial
801 - fsi_box (Failed) elasticsolve fsi serial transient
I'll try to look at these, but mostly these might be inconsequential ...
540 - SD_H1BasisEvaluation (292 -
292 - H1BasisEvaluation (Failed) benchmark serial
Slightly different result norm. Someone should have a look, whether OK anyway...
920 - pointload2 (Failed) serial
The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: [ https://csc.fi/en/privacy | Privacy ]
Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: [ https://csc.fi/tietosuoja | Tietosuoja ]
|
Thank you for looking into this. I rebased this PR on top of your latest changes. Let's see if this will make a difference. |
That doesn't seem to have made much difference.
|
OK, thanks, I'll have a look at the log.
|
So it seems something more complicated than the iterator failure. Can I somehow misuse the git "push" triggered "Action"
to do debugging (without waiting for too long) ?
|
Debugging using only GitHub actions is pretty tedious unfortunately. Unfortunately, I'm not aware of any option to log into the runners and run commands manually. The next "best" thing (but still pretty bad) what I mostly resort to is:
That is still quite tedious and time consuming. It would be much easier if I knew how to stop in a debugger or anything like that in a GitHub action. |
An observation about the linear system iterator fails on ARM64 platform:
It seems that the DNRM2() - function used by all iterative methods fails at random intervals. I don't know
the reason. I think this is linked in from the openblas() - library here ?
If I use my one norm function to replace that, everything starts to work. Other openblas() functions don't
seem to have problems.
|
Tests
ConstParamFunc, ConvergenceControl, HeatControlExplicit, TransientCostFourHeater, fsi_box, OptimizeSimplexFourHeaters,
OptimizeSimplexFourHeatersInt
all work out of box after this change.
This is something not seen on any other platform.
From: "Juha Ruokolainen" ***@***.***>
To: "ElmerCSC" ***@***.***>
Cc: "ElmerCSC" ***@***.***>, "juharu" ***@***.***>, "Comment" ***@***.***>
Sent: Friday, 24 January, 2025 14:15:14
Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634)
An observation about the linear system iterator fails on ARM64 platform:
It seems that the DNRM2() - function used by all iterative methods fails at random intervals. I don't know
the reason. I think this is linked in from the openblas() - library here ?
If I use my one norm function to replace that, everything starts to work. Other openblas() functions don't
seem to have problems.
|
Good finding! Afaict, Ubuntu 24.04 distributes OpenBLAS version 0.3.26. I found the following commit for upstream OpenBLAS that could be related: I.e., they changed the implementation of DNRM2 on the Neoverse N2 due to inaccuracies. That processor is used on the GitHub runners. I'll try to switch to using the reference BLAS implementation on the ARM64 runner. Maybe, that will make a difference. |
GitHub started hosting runners with Ubuntu on ARM64 processors for open source projects for free: https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/ Add one configuration that is using these runners to the build matrix.
Switching to the reference BLAS implementation reduced the number of failing tests from 10 to four:
That's a huge step forward. 🎉 The remaining test failures are mainly the same ones that are also failing on macOS. (Not sure if that is coincidental.) The only other (and new) one might be because numerical results are slightly different using the reference implementation compared to an optimized implementation (like OpenBLAS). I haven't checked though... |
Afaict,
|
OK, that doesnt seem too bad, I'll have a look ...
From: "Markus Mützel" ***@***.***>
To: "ElmerCSC" ***@***.***>
Cc: "juharu" ***@***.***>, "Comment" ***@***.***>
Sent: Friday, 24 January, 2025 18:10:44
Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634)
Afaict, SD_P2ndDerivatives (the only new one with respect to the previously failing tests) is failing with an accuracy issue:
Testing 2nd derivatives with p(8)...
(202)...PASSED (303)...PASSED (404)...PASSED (504)... dfyz: 0.85232585644089298 0.85232542765237440 4.2878851858052514E-007 > 4.2616292834281920E-007
—
Reply to this email directly, [ #634 (comment) |
view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ACTOMSTHT2VYBSSZJRGUMZL2MJQYJAVCNFSM6AAAAABVVLY37GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJSHA4TAMJZG4 |
unsubscribe ] .
You are receiving this because you commented. Message ID: ***@***.***>
The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: [ https://csc.fi/en/privacy | Privacy ]
Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: [ https://csc.fi/tietosuoja | Tietosuoja ]
|
Another observation about the H1BasisEvaluation -tests: only fail on ARM64 if OpenMP activated.
From: "Juha Ruokolainen" ***@***.***>
To: "ElmerCSC" ***@***.***>
Cc: "ElmerCSC" ***@***.***>, "juharu" ***@***.***>, "Comment" ***@***.***>
Sent: Friday, 24 January, 2025 19:15:08
Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634)
OK, that doesnt seem too bad, I'll have a look ...
From: "Markus Mützel" ***@***.***>
To: "ElmerCSC" ***@***.***>
Cc: "juharu" ***@***.***>, "Comment" ***@***.***>
Sent: Friday, 24 January, 2025 18:10:44
Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634)
Afaict, SD_P2ndDerivatives (the only new one with respect to the previously failing tests) is failing with an accuracy issue:
Testing 2nd derivatives with p(8)...
(202)...PASSED (303)...PASSED (404)...PASSED (504)... dfyz: 0.85232585644089298 0.85232542765237440 4.2878851858052514E-007 > 4.2616292834281920E-007
—
Reply to this email directly, [ #634 (comment) |
view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/ACTOMSTHT2VYBSSZJRGUMZL2MJQYJAVCNFSM6AAAAABVVLY37GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJSHA4TAMJZG4 |
unsubscribe ] .
You are receiving this because you commented. Message ID: ***@***.***>
The information in this email may be confidential and is intended solely for the use of the individual or entity to whom it is intended. If you are not the intended recipient of this message, please delete the message and notify the sender immediately. For information on how we process personal data and our contact information, please see CSC's website: [ https://csc.fi/en/privacy | Privacy ]
Tämän sähköpostin tiedot voivat olla luottamuksellisia ja ne on tarkoitettu yksinomaan sen henkilön tai yhteisön käyttöön, jolle ne on osoitettu. Jos et ole viestissä tarkoitettu vastaanottaja, tuhoa viesti ja ilmoita asiasta välittömästi viestin lähettäjälle. Tietoja henkilötietojen ja yhteystietojen käsittelystä löydät CSC:n verkkosivuilta: [ https://csc.fi/tietosuoja | Tietosuoja ]
|
So, -O3 + OpenMP to trigger the fail on ARM64 (and probably on macos-14), even if running with 1 task, otherwise and elsewhere OK.
From: "Juha Ruokolainen" ***@***.***>
To: "ElmerCSC" ***@***.***>
Cc: "ElmerCSC" ***@***.***>, "juharu" ***@***.***>, "Comment" ***@***.***>
Sent: Saturday, 25 January, 2025 10:26:11
Subject: Re: [ElmerCSC/elmerfem] CI: Add runner with Ubuntu on ARM64 (PR #634)
Another observation about the H1BasisEvaluation -tests: only fail on ARM64 if OpenMP activated.
|
Good finding. Do you know which OpenMP constructs are causing the test errors? It's hard to find a bug report or potential workarounds with only using these broad keywords... Is a combination of OpenMP and Anyway: If I understand correctly, OpenMP is only used in parts of elmerfem. And for most use cases, it is not crucial whether OpenMP is used or not. OpenMP is deactivated by default in the build rules. While looking through the code, I found a couple of places in |
GitHub started hosting runners with Ubuntu on ARM64 processors for open source projects for free:
https://github.blog/changelog/2025-01-16-linux-arm64-hosted-runners-now-available-for-free-in-public-repositories-public-preview/
Add one configuration that is using these runners to the build matrix.
According to their blog post, the arm64 runners are more efficient and potentially faster than the x86_64 runners. (But it is still in a preview phase and maybe it will take some time for them to better scale and balance the load.) If that turns out to be true, it should be easy to switch more configurations in that workflow to the arm64 runners (and maybe keep only one or two running on x86_64).