Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrink aarch64 wheels #170

Open
mattip opened this issue Jul 29, 2024 · 16 comments
Open

Shrink aarch64 wheels #170

mattip opened this issue Jul 29, 2024 · 16 comments

Comments

@mattip
Copy link
Collaborator

mattip commented Jul 29, 2024

I wonder if the problem with aarch64 builds on travisCI is that we are running out of memory and the build process is killed (on manylinux/glibc). Travis has a 3GB limit. Similar to issue #144 and the PR #166, we should benchmark aarch64 on a high-end aarch64 machine.

@ev-br is this something you could do? Is the AWS m7g instance (with a graviton3 processor) advanced enough to use the THUNDERX3T110 kernels or is that targeting some other processor?

@Mousius
Copy link

Mousius commented Jul 29, 2024

The THUNDERX3T110 target uses AdvSIMD only, whereas the NEOVERSEV1 target on the AWS M7g can use SVE. Mostly the SVE targets remap back to NEOVERSEV1 at the moment, so removing that would be pretty bad for performance.

@Mousius
Copy link

Mousius commented Jul 29, 2024

I remapped any common targets back together in OpenMathLib/OpenBLAS#4389, unsure how to tell which targets are less used and be removed 🤔

Also ref: https://github.com/OpenMathLib/OpenBLAS/blob/develop/Makefile.system#L686-L700

@mattip
Copy link
Collaborator Author

mattip commented Jul 29, 2024

Thanks. Is NEOVERSEV1 active when using GCC (like in the build here)?

@ev-br
Copy link

ev-br commented Jul 29, 2024

BLAS-benchmarks runs on a c7g.large instance (https://aws.amazon.com/ec2/instance-types/c7g/) via https://github.com/OpenMathLib/BLAS-Benchmarks/blob/main/.cirun.yml
Would this be enough?

Also, does @czgdp1807 benchmarking machinery handle aarch architectures?

@Mousius
Copy link

Mousius commented Jul 29, 2024

Thanks. Is NEOVERSEV1 active when using GCC (like in the build here)?

In manylinux2014 with GCC 10.2 you should get the SVE targets.

For certain toolchains, such as the MACOSX_DEPLOYMENT_TARGET, there isn't full support and it's disabled.

@mattip
Copy link
Collaborator Author

mattip commented Jul 29, 2024

In manylinux2014 with GCC 10.2 you should get the SVE targets.

Cool, thanks

BLAS-benchmarks runs on a c7g.large

That is graviton3, so should be as good as it gets.

does @czgdp1807 benchmarking machinery handle aarch architectures?

I think so, you need to specify a different set of kernels. You can see which ones in the Maekfile.system from this comment

@ev-br
Copy link

ev-br commented Jul 31, 2024

Ok, one benchmark: this is Linux on arm64 not MacOS on a c7g.large machine on AWS:

{'arch': 'aarch64', 'cpu': '', 'machine': 'ip-172-31-6-241', 'num_cpu': '2', 'os': 'Linux 6.8.0-1009-aws', 'ram': '3899308', 'python': '3.12', 'Cython': '', 'build': '', 'packaging': ''}
bench_linalg.Eindot.time_matmul_a_b
| arch          |     mean |   spread |   perf_ratios |
|:--------------|---------:|---------:|--------------:|
| NEOVERSEV1    | 0.10003  | 0.000357 |       1       |
| ARMV8SVE      | 0.106404 | 0.000465 |       1.06372 |
| CORTEXA73     | 0.122021 | 0.00047  |       1.21984 |
| ARMV8         | 0.12206  | 0.0002   |       1.22023 |
| CORTEXA710    | 0.122363 | 0.000195 |       1.22326 |
| TSV110        | 0.122464 | 0.000285 |       1.22427 |
| CORTEXA510    | 0.122549 | 0.000155 |       1.22512 |
| NEOVERSEN1    | 0.122552 | 0.00038  |       1.22515 |
| FALKOR        | 0.122615 | 0.000345 |       1.22578 |
| CORTEXA72     | 0.122624 | 0.000125 |       1.22587 |
| A64FX         | 0.122658 | 0.000415 |       1.22621 |
| CORTEXX2      | 0.122666 | 0.00016  |       1.22628 |
| EMAG8180      | 0.122683 | 0.00029  |       1.22645 |
| CORTEXA76     | 0.122714 | 0.000335 |       1.22676 |
| CORTEXX1      | 0.122719 | 0.00028  |       1.22682 |
| FT2000        | 0.122807 | 0.00014  |       1.2277  |
| CORTEXA57     | 0.122884 | 0.00027  |       1.22847 |
| VORTEX        | 0.122974 | 0.00038  |       1.22937 |
| NEOVERSEN2    | 0.123136 | 0.00039  |       1.23099 |
| THUNDERX3T110 | 0.125751 | 0.000185 |       1.25713 |
| THUNDERX2T99  | 0.127315 | 0.00061  |       1.27276 |
| CORTEXA55     | 0.152537 | 0.000585 |       1.5249  |
| CORTEXA53     | 0.153044 | 0.000605 |       1.52998 |
| THUNDERX      | 0.241916 | 0.00081  |       2.41843 |

The rest of benchmarks are running, will see how different they look.

@Mousius
Copy link

Mousius commented Jul 31, 2024

It'd be good to test these on an r8g instance as well, as that has 128-bit SVE - with the c7g, you have 256-bit SVE so that the SVE kernels can perform differently. It's also worth noting that the A64FX target would benefit from being run on that specific core, as that has 512-bit SVE and slightly different kernels.

@mattip
Copy link
Collaborator Author

mattip commented Jul 31, 2024

@Mousius could you weigh in about a possible set of kernels that make sense? Over at #166 I suggested ARMV8 CORTEXA57 NEOVERSEV1 THUNDERX, but had to use ARMV8 CORTEXA57 THUNDERX on the EOL musllinux_1_1 build since the gcc there (9.2) does not support SVE.

@ev-br
Copy link

ev-br commented Jul 31, 2024

@Mousius
Copy link

Mousius commented Aug 1, 2024

I've tried tweaking some constants in OpenMathLib/OpenBLAS#4833, if we do this, we could potentially have ARMV8 and ARMV8SVE without losing too much 🤔

Do you mind benchmarking these changes @ev-br ?

@ev-br
Copy link

ev-br commented Aug 1, 2024

TL;DR: not easily, sadly.
Unless your changes are visible on codspeed or (will be visible on) blas-benchmarks next Wednesaday after your PR merges. Or if you've a suggestion of how to extend either codspeed or blas-benchmarks setups to probe your changes.

There are two ways OpenBLAS benchmarks run currently:

Both were set up as a part of an STF project co-PI-ed by @martin-frbg and @rgommers . The AWS costs for blas-benchmarks weekly runs are also picked up by Quansight (I believe).

I'm happy to help extending the set of benchmarks these two services run --- do you have suggestions what would be useful to add? Large-scale restructurings I'm also happy to work on, but these will have to be cleared through Quansight first.

Neither of these has per-kernel granularity though.
My one-off per-kernel runs of #144 (comment) and #170 (comment) are a bit different: these are numpy benchmarks, and also rely on scipy-openblas32 wheels. Also worth noting that these runs rely on benchmarking scripts by Matti and Gagan, developed as a part of some other Quansight funded effort, not sure which one.

I was only able to run these one-off experiments because a) Matti and Gagan had the benchmarking scripts, b) I have the AWS setup ready from the blas-benchmark work, and c) Quansight basically shrugged off the costs of a couple of hours of CPU and engineering time. I'm definitely happy to evolve either sets of benchmarks or set up some other strategy---when it's cleared with Quansight.


So possible concrete steps:

Easy ones:

Needs some design:

  • do we want to automate benchmarking runs per kernel? Then
    • which benchmarks: pure BLAS/LAPACK similar to codspeed or numpy benchmarks?
    • nightly scipy-openblas wheels or from-source OpenBLAS builds?
    • On AWS? Who picks up the bill then.
    • How to trigger the runs? The numpy suite with per-kernel runs is \approx 1-2 hours. Might be too much for each OpenBLAS PR, so a weekly/nightly run? Or a manual trigger (similar to scipy wheel builds: then who triggers the runs).

@rgommers
Copy link
Collaborator

rgommers commented Aug 1, 2024

I think this is fairly low-prio? I'd move from TravisCI to Cirrus CI and be done with it to address the CI problem. The gain in binary size is much more limited than for x86-64, plus download numbers are way lower. So I don't think this is worth spending a lot of time on at the moment.

@Mousius
Copy link

Mousius commented Aug 2, 2024

Hi @ev-br,

I meant the benchmarks in #170 (comment) only 😸

It those one-shot benchmarks show the ARMV8 target getting close enough to the NEOVERSEN1 target and the ARMV8SVE target getting close enough to the NEOVERSEV1 target then that's good indication that it'll work for a number of modern cores.

@mattip is it easy to use the infra in this repo to build from a my branch of OpenBLAS? It'd be easier than trying to recreate the build parameters you've used 😸

@rgommers understood, hopefully this minimal amount step is enough 😸

@ev-br
Copy link

ev-br commented Aug 2, 2024

I meant the benchmarks in #170 (comment) only

Yeah, a technical hurdle here is that numpy benchmarks need a python wheel, and I'm not sure how to generate one from a local OpenBLAS build.

@martin-frbg
Copy link

🐸only do flywheels, but perhaps it would be sufficient to replace the libscipy-openblas in numpy.libs with your identically named own build after installing the stock numpy wheel ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants