Shrink aarch64 wheels #170

mattip · 2024-07-29T13:30:32Z

I wonder if the problem with aarch64 builds on travisCI is that we are running out of memory and the build process is killed (on manylinux/glibc). Travis has a 3GB limit. Similar to issue #144 and the PR #166, we should benchmark aarch64 on a high-end aarch64 machine.

@ev-br is this something you could do? Is the AWS m7g instance (with a graviton3 processor) advanced enough to use the THUNDERX3T110 kernels or is that targeting some other processor?

The text was updated successfully, but these errors were encountered:

Mousius · 2024-07-29T14:00:57Z

The THUNDERX3T110 target uses AdvSIMD only, whereas the NEOVERSEV1 target on the AWS M7g can use SVE. Mostly the SVE targets remap back to NEOVERSEV1 at the moment, so removing that would be pretty bad for performance.

Mousius · 2024-07-29T14:07:09Z

I remapped any common targets back together in OpenMathLib/OpenBLAS#4389, unsure how to tell which targets are less used and be removed 🤔

Also ref: https://github.com/OpenMathLib/OpenBLAS/blob/develop/Makefile.system#L686-L700

mattip · 2024-07-29T14:10:34Z

Thanks. Is NEOVERSEV1 active when using GCC (like in the build here)?

ev-br · 2024-07-29T14:14:56Z

BLAS-benchmarks runs on a c7g.large instance (https://aws.amazon.com/ec2/instance-types/c7g/) via https://github.com/OpenMathLib/BLAS-Benchmarks/blob/main/.cirun.yml
Would this be enough?

Also, does @czgdp1807 benchmarking machinery handle aarch architectures?

Mousius · 2024-07-29T14:15:50Z

Thanks. Is NEOVERSEV1 active when using GCC (like in the build here)?

In manylinux2014 with GCC 10.2 you should get the SVE targets.

For certain toolchains, such as the MACOSX_DEPLOYMENT_TARGET, there isn't full support and it's disabled.

mattip · 2024-07-29T14:30:43Z

In manylinux2014 with GCC 10.2 you should get the SVE targets.

Cool, thanks

BLAS-benchmarks runs on a c7g.large

That is graviton3, so should be as good as it gets.

does @czgdp1807 benchmarking machinery handle aarch architectures?

I think so, you need to specify a different set of kernels. You can see which ones in the Maekfile.system from this comment

ev-br · 2024-07-31T12:05:40Z

Ok, one benchmark: this is Linux on arm64 not MacOS on a c7g.large machine on AWS:

{'arch': 'aarch64', 'cpu': '', 'machine': 'ip-172-31-6-241', 'num_cpu': '2', 'os': 'Linux 6.8.0-1009-aws', 'ram': '3899308', 'python': '3.12', 'Cython': '', 'build': '', 'packaging': ''}
bench_linalg.Eindot.time_matmul_a_b
| arch          |     mean |   spread |   perf_ratios |
|:--------------|---------:|---------:|--------------:|
| NEOVERSEV1    | 0.10003  | 0.000357 |       1       |
| ARMV8SVE      | 0.106404 | 0.000465 |       1.06372 |
| CORTEXA73     | 0.122021 | 0.00047  |       1.21984 |
| ARMV8         | 0.12206  | 0.0002   |       1.22023 |
| CORTEXA710    | 0.122363 | 0.000195 |       1.22326 |
| TSV110        | 0.122464 | 0.000285 |       1.22427 |
| CORTEXA510    | 0.122549 | 0.000155 |       1.22512 |
| NEOVERSEN1    | 0.122552 | 0.00038  |       1.22515 |
| FALKOR        | 0.122615 | 0.000345 |       1.22578 |
| CORTEXA72     | 0.122624 | 0.000125 |       1.22587 |
| A64FX         | 0.122658 | 0.000415 |       1.22621 |
| CORTEXX2      | 0.122666 | 0.00016  |       1.22628 |
| EMAG8180      | 0.122683 | 0.00029  |       1.22645 |
| CORTEXA76     | 0.122714 | 0.000335 |       1.22676 |
| CORTEXX1      | 0.122719 | 0.00028  |       1.22682 |
| FT2000        | 0.122807 | 0.00014  |       1.2277  |
| CORTEXA57     | 0.122884 | 0.00027  |       1.22847 |
| VORTEX        | 0.122974 | 0.00038  |       1.22937 |
| NEOVERSEN2    | 0.123136 | 0.00039  |       1.23099 |
| THUNDERX3T110 | 0.125751 | 0.000185 |       1.25713 |
| THUNDERX2T99  | 0.127315 | 0.00061  |       1.27276 |
| CORTEXA55     | 0.152537 | 0.000585 |       1.5249  |
| CORTEXA53     | 0.153044 | 0.000605 |       1.52998 |
| THUNDERX      | 0.241916 | 0.00081  |       2.41843 |

The rest of benchmarks are running, will see how different they look.

Mousius · 2024-07-31T12:33:22Z

It'd be good to test these on an r8g instance as well, as that has 128-bit SVE - with the c7g, you have 256-bit SVE so that the SVE kernels can perform differently. It's also worth noting that the A64FX target would benefit from being run on that specific core, as that has 512-bit SVE and slightly different kernels.

mattip · 2024-07-31T13:10:43Z

@Mousius could you weigh in about a possible set of kernels that make sense? Over at #166 I suggested ARMV8 CORTEXA57 NEOVERSEV1 THUNDERX, but had to use ARMV8 CORTEXA57 THUNDERX on the EOL musllinux_1_1 build since the gcc there (9.2) does not support SVE.

ev-br · 2024-07-31T13:28:51Z

full bench suite on c7g: https://gist.github.com/ev-br/c1a35b386c90d8eaac484520d8256927

Mousius · 2024-08-01T16:17:42Z

I've tried tweaking some constants in OpenMathLib/OpenBLAS#4833, if we do this, we could potentially have ARMV8 and ARMV8SVE without losing too much 🤔

Do you mind benchmarking these changes @ev-br ?

ev-br · 2024-08-01T17:46:48Z

TL;DR: not easily, sadly.
Unless your changes are visible on codspeed or (will be visible on) blas-benchmarks next Wednesaday after your PR merges. Or if you've a suggestion of how to extend either codspeed or blas-benchmarks setups to probe your changes.

There are two ways OpenBLAS benchmarks run currently:

On codspeed for each OpenBLAS pull request: https://codspeed.io/OpenMathLib/OpenBLAS/branches/Mousius:improve-sve-constants
Weekly on a c7g and m5 machines on AWS via cirun.io: http://www.openmathlib.org/BLAS-Benchmarks/ is the dashboars and https://github.com/OpenMathLib/BLAS-Benchmarks is the repository; it relies on scipy-openblas32 weekly builds from the anaconda nightly bucket (https://github.com/OpenMathLib/BLAS-Benchmarks/blob/main/.github/workflows/run_cirun_graviton.yml#L97)

Both were set up as a part of an STF project co-PI-ed by @martin-frbg and @rgommers . The AWS costs for blas-benchmarks weekly runs are also picked up by Quansight (I believe).

I'm happy to help extending the set of benchmarks these two services run --- do you have suggestions what would be useful to add? Large-scale restructurings I'm also happy to work on, but these will have to be cleared through Quansight first.

Neither of these has per-kernel granularity though.
My one-off per-kernel runs of #144 (comment) and #170 (comment) are a bit different: these are numpy benchmarks, and also rely on scipy-openblas32 wheels. Also worth noting that these runs rely on benchmarking scripts by Matti and Gagan, developed as a part of some other Quansight funded effort, not sure which one.

I was only able to run these one-off experiments because a) Matti and Gagan had the benchmarking scripts, b) I have the AWS setup ready from the blas-benchmark work, and c) Quansight basically shrugged off the costs of a couple of hours of CPU and engineering time. I'm definitely happy to evolve either sets of benchmarks or set up some other strategy---when it's cleared with Quansight.

So possible concrete steps:

Easy ones:

I can easily repeat the run of Shrink aarch64 wheels #170 (comment) using Tune generic SVE parameters closer to other SVE cores and add tunings to baseline AArch64 OpenMathLib/OpenBLAS#4833 if there's an artifact I can copy over like in Shrink x86_64 blas library size #144 (comment). is there one?
is there something concrete to add to blas-benchmarks setup?

Needs some design:

do we want to automate benchmarking runs per kernel? Then
- which benchmarks: pure BLAS/LAPACK similar to codspeed or numpy benchmarks?
- nightly scipy-openblas wheels or from-source OpenBLAS builds?
- On AWS? Who picks up the bill then.
- How to trigger the runs? The numpy suite with per-kernel runs is \approx 1-2 hours. Might be too much for each OpenBLAS PR, so a weekly/nightly run? Or a manual trigger (similar to scipy wheel builds: then who triggers the runs).

rgommers · 2024-08-01T18:01:33Z

I think this is fairly low-prio? I'd move from TravisCI to Cirrus CI and be done with it to address the CI problem. The gain in binary size is much more limited than for x86-64, plus download numbers are way lower. So I don't think this is worth spending a lot of time on at the moment.

Mousius · 2024-08-02T10:39:32Z

Hi @ev-br,

I meant the benchmarks in #170 (comment) only 😸

It those one-shot benchmarks show the ARMV8 target getting close enough to the NEOVERSEN1 target and the ARMV8SVE target getting close enough to the NEOVERSEV1 target then that's good indication that it'll work for a number of modern cores.

@mattip is it easy to use the infra in this repo to build from a my branch of OpenBLAS? It'd be easier than trying to recreate the build parameters you've used 😸

@rgommers understood, hopefully this minimal amount step is enough 😸

ev-br · 2024-08-02T15:26:55Z

I meant the benchmarks in #170 (comment) only

Yeah, a technical hurdle here is that numpy benchmarks need a python wheel, and I'm not sure how to generate one from a local OpenBLAS build.

martin-frbg · 2024-08-03T09:27:54Z

🐸only do flywheels, but perhaps it would be sufficient to replace the libscipy-openblas in numpy.libs with your identically named own build after installing the stock numpy wheel ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrink aarch64 wheels #170

Shrink aarch64 wheels #170

mattip commented Jul 29, 2024

Mousius commented Jul 29, 2024

Mousius commented Jul 29, 2024

mattip commented Jul 29, 2024

ev-br commented Jul 29, 2024

Mousius commented Jul 29, 2024

mattip commented Jul 29, 2024

ev-br commented Jul 31, 2024

Mousius commented Jul 31, 2024

mattip commented Jul 31, 2024

ev-br commented Jul 31, 2024

Mousius commented Aug 1, 2024

ev-br commented Aug 1, 2024

rgommers commented Aug 1, 2024

Mousius commented Aug 2, 2024

ev-br commented Aug 2, 2024

martin-frbg commented Aug 3, 2024

Shrink aarch64 wheels #170

Shrink aarch64 wheels #170

Comments

mattip commented Jul 29, 2024

Mousius commented Jul 29, 2024

Mousius commented Jul 29, 2024

mattip commented Jul 29, 2024

ev-br commented Jul 29, 2024

Mousius commented Jul 29, 2024

mattip commented Jul 29, 2024

ev-br commented Jul 31, 2024

Mousius commented Jul 31, 2024

mattip commented Jul 31, 2024

ev-br commented Jul 31, 2024

Mousius commented Aug 1, 2024

ev-br commented Aug 1, 2024

rgommers commented Aug 1, 2024

Mousius commented Aug 2, 2024

ev-br commented Aug 2, 2024

martin-frbg commented Aug 3, 2024