Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite LAPACK so it's not in F77 #7

Open
itamarst opened this issue Aug 13, 2024 · 10 comments
Open

Rewrite LAPACK so it's not in F77 #7

itamarst opened this issue Aug 13, 2024 · 10 comments

Comments

@itamarst
Copy link

itamarst commented Aug 13, 2024

From @ilayn, who can hopefully expand:

We must (full emphasis on must) port LAPACK out of F77 for fast computing as scientific community (of all languages). LAPACK codebase is treated as the scripture, and for a good reason because the folks who wrote the algorithms have immense insight and expertise. But that does not mean they coded the most optimal code and also F77 does not lend itself to anything modern hardware offers. You get only what gfortran offers for the loop optimization. While the algorithms are rock solid, they have no way other than writing for loops as if there is no tomorrow. I know this because I spent reading similar code for the good chunk of last year due to META: FORTRAN Code inventory · Issue #18566 · scipy/scipy · GitHub. LAPACK routines are often older than SciPy’s *PACK libraries.

@ilayn
Copy link

ilayn commented Aug 13, 2024

Indeed this needs some explanation because 99% of the responses will be "yeah right, sure" which is not the audience I care about so I don't feel discouraged by any of such responses hence feel free if you feel like it.

Acknowledgements

I think current and past authors and contributors of BLAS and LAPACK should be knighted and placed in chateaus and mansions with 4 lambos and 2 yachts each. There is no way to measure their contributions to the scientific community. It is in fact make me feel a bit uneasy given how much money is poured into AI and not to this fundamental library.

So I will make some remarks below but this is purely from future-proof perspective and nothing related to past decisions. It is what it is. And in fact I know maintainers of Reference LAPACK already contemplate about similar points. But if I summarize them here, it is going to open up a different discussion which I want to skip completely.

Meat of the argument

The goal is to store LAPACK in a format that is both

  • Accessible to be compiled in any platform with no significant toolchain problems,
  • Readable to be ported to any platform with no significant preliminary knowledge in that language.

and currently it is neither with fortran77.

Very brief and opinionated personal summary;

Currently, LAPACK is the fundamental reason why we need fortran compilers or even we need to care about fortran in general if we are not using fortran.

BLAS has been ported to C/Assembly multiple times in various forms and except the reference implementation, nobody uses it in its native form. If BLAS is to NumPy, LAPACK is to SciPy in our case; LAPACK heavily uses BLAS operations but essentially it is just a never-ending stream of for loops and index olympics to do clever stuff with arrays. Thus it has no special property that blocks such porting to whichever language we would be using.

Therefore fundamentally, F77 is the gatekeeper of taking that code and implementing anywhere that has a suitable compiler of any other language. In other words, if the reference implementation was in C, it would have been substantially easier to port it to almost all platforms under the blue sky. And I would speculate that it would have been translated to many languages.

In fact the totality of the numerical community is forced to either find a fortran compiler or use a compiled library but still forced to have a Fortran layer on top to prevent ABI compatibility issues. As every SciPy maintainer by now knows, it is just an endless pain point. Plus, initiatives such as Pyodide are always getting stuck at wrapping fortran code and anytime SciPy rewrites anything from F77 to C they remove a good chunk of unnecessary code that was meant for handling f77.

Besides, Fortran community is dwarfed by C community and also when it comes to low-level optimizations, lot of folks who can perform SSE optimizations cannot do anything about the LAPACK code since it is in Fortran. So it is a loss of capacity and skill. Anything to be done is already being done to BLAS such as BLIS or GotoBLAS/OpenBLAS, since it is a smaller codebase and also simple to follow the functionality goals, but for LAPACK very little has been done, say, OpenBLAS having a custom parallelized ?getrf written in C and so on. Therefore to be able to increase the attention that LAPACK gets from the international community it has to be archived in a proper common and accessible language. And I argue here that F77 is not that. Therefore a port is sorely missed. And similarly that's why modern fortran is not relevant here. Since if a port will take place, we don't need to modernize fortran we need to get it out to a more common language.

So far there have been only two, battle tested, and proven, software languages that has true archival properties. First one is obvious, Fortran77 and the other one is C (I'm talking about multi-decades). Hence we are left with C. Now, it would be unfair to call C as a better fortran alternative when it comes to numerical array computing. Because it absolutely is not. It has less unergonomic parts however it is just tedious to have straightforward code. But a C array is just a pacifier to hide pointer arithmetic. And we are talking about a huge array library. Some might say well what is wrong with A[z + m*y + m*n*x] or A[x][y][z] instead of A[x, y, z] and we don't need to have a debate. I take these syntax as self-evident. You would too if you spent your days in writing array code. And this is the reason C++ is not even considered. It is very difficult to write unabstracted and sound numerical code in C++ without using some strange new features that have questionable future such as C++2017 std::vector all the way to C++ 2023 std::mdspan. It seems like every year another fancy container is discovered in the standard. In C++ 2026 even LAPACK is supposed to be coming in the std::linalg https://www.youtube.com/watch?v=-UXHMlAMXNk. No offense to C++ devs but even the examples look worse than F77 versions. Hence I have a strong opinion that this is not something we should port to.

The most problematic issue for C is Microsoft and its relentless insistence on not conforming to C99 standards. Therefore, it is always some stupid boilerplate code that needs to be included at the top of the project structure and can hurt in the long run. However, it has been done countless times and can stay as a necessary evil. Or can be left for the folks who want to use MSVC. If open source C compilers become widely adopted in Windows ecosystem then probably the best strategy might be is to drop MSVC completely and be done with it. But until then, it would hurt majority of the users. This might sound bad but it is much worse for Windows if Fortran is used already which is why we are here.

An ode to fortran though

Fortran, old or new, regardless of its age, has superior array ergonomics, hands down, no competition. It is fantastic in that regard, so much so that, NumPy borrowed almost all identically. But that's where good things about F77 end for us. Because it is typically 1-indexed and column major memory layout, it does not play well with anything else (except julia and matlab). It is possible to change the indexing in fortran however LAPACK code did not. Hence one of the biggest annoyances of such rewrite will be 2D array index bonanza if C to be selected but the advantages so far outweigh this very important annoyance.

Language designers completely miss the number crunching community and we are always piggybacking on regular langs such as Python's @ operator and ndarray silliness of all C inspired languages and VLAs and countless other ergonomics killing details and so on.

In summary, fortran is a very good language to stick with, if you don't interface with other languages (ctypes are not a thing in this context) or you are not packaging anything to be run on all platforms. Unfortunately, the majority of the contemporary community does and are. Also its future is bound to a very particular audience and to a very small committee skewed towards HPC use cases.

Obligatory Rust and Zig section

Why not Zig or Rust? Rust already has a couple of attempts, a very nice one is https://github.com/sarah-ek/faer-rs and a there are few others via ndalgebra crate. But its impossible looking syntax is killing the starting rationale of this rewrite.

Zig has multidimensional arrays but it is no different than C and also does not come with complex numbers which is in my opinion two very important trains missed. It is now available in std::math but I can't make heads or tails from this issue ziglang/zig#16278. Typically, folks who don't use complex numbers dismiss this as "It's just a struct what about it?!?" and an endless debate ensues.

Why not others langs?

I don't know them. Put it below if you want to make a case. But we might as well use C to get all the middle man out despite all the problems given above.

#IFNDEF HORROR, it's here

So what kind of a beast are we talking about? Well, as of 20240813 using cloc tool

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Fortran 77                    3548           3236         922596         622695
C                             2870          15422          90590         166352
C/C++ Header                    12           3144            393          51340
Fortran 90                      30            364           8594           6040
CMake                           40            520            537           5594
make                            19            445            492           4837
TeX                              1            242            178           1264
YAML                             5             66             44            413
Python                           1             33             23            271
Markdown                         7             53              0            266
XML                              2             17              8            251
-------------------------------------------------------------------------------
SUM:                          6535          23542        1023455         859323
-------------------------------------------------------------------------------

This is a utterly spectacularly annoying problem. However before you close the tab, notice that this is the totality of BLAS/LAPACK tests and other stuff multiple versions etc. If we just look at the SRC directory of LAPACK (no BLAS no nothing else)

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Fortran 77                    2111           2641         573661         314089
Fortran 90                      18             38           6919           3302
C/C++ Header                     1              2              0           2275
make                             2             55             72            584
CMake                            1             27             39            526
-------------------------------------------------------------------------------
SUM:                          2133           2763         580691         320776
-------------------------------------------------------------------------------

Now it is still at least spectacularly annoying but halved. However note that most LAPACK code is in 4 flavors. That is the same routine for <s, d, c, z> (single, double, single complex, double complex) hence this is actually around 90k SLOC of code or original code. It does not reduce the pain but it definitely reduce the suffering.

Now that is just very annoying but it is precisely how much fortran77 code we had in SciPy too and we are currently passed the halfway point nearing 70%. It took me 16 months to get to here mostly 1/8 FTE or 5 hours/week on average. To be honest that's pretty much doable in my book given my track record in SciPy rewrites.

Also note that not all LAPACK code is essential. Many parts of it has deprecated things or historical warts.

Last points

There is no more important code in the scientific ecosystem than LAPACK. Your blackhole research or cure for cancer can resume in a much much better state if we collectively do this and be done with it because then we can in fact handle this unspoken embarrassing situation.

As I have written in my acknowledgements, we must support and lift the original reference library to the next level.

@ilayn
Copy link

ilayn commented Aug 13, 2024

In case it will avoid pointless debates for the future, I don't like working with C especially after using C3, Zig and bunch of other langs which fix "a different part of C and not the parts I need". I am just forced into it because of the lack of another specialized array language. My unicorn wish is to have a truly "arrays are at the center" language. Think of the ever so wonderful Lua and tables, but instead <insert lang> and arrays.

@rgommers
Copy link

A few small point to summarize the pages of thoughts I could have written about this topic (not enough time, sorry):

(1) For SciPy, once we get rid of all our own Fortran code, we have no fundamental problem with Reference LAPACK being written in F77. The problem we have with BLAS and LAPACK is an ABI problem - multiple problems actually, from LAPACKE not being well-enough supported to the routines like cdotc with two incompatible ABIs, to ILP64 symbol names. That the reference implementation is in Fortran doesn't matter so much, since we never want to use it at runtime (except maybe for debugging).

(2) I'm honestly not sure if it would help for LAPACK to be rewritten. Whether some vendor or open source effort writes an accelerated implementation is also based on need. The payoff for an individual routine to be written with SIMD/assembly/etc. is much smaller for LAPACK than it is for BLAS, so the chances of that effort being done are also much lower.

@ilayn
Copy link

ilayn commented Aug 16, 2024

A few small point to summarize the pages of thoughts I could have written about this topic (not enough time, sorry):

We should get to a retreat in a cabin somewhere so we can compare pages.

(1) I can also go the other way, the reason why you have ABI problems is actually f77. The very reason for LAPACKE existing is also f77. I can give you another word that hides a saga; LSAME. Or you can think of NaN/INF handling, if you think of all the ABI issues and which language they are originating from, there is a pattern there. The main issue is the absence of a formal fortran ABI. I understand your view since this is a SciPy pain and in your view if the ABI was OK, why bother. However, this is not a SciPy problem and much larger. And f77 has blocked all the possible progress with the exact argument so far spelled out who knows how many times.

(2) This one I don't follow yet. Whatever makes BLIS fast can be applied to intermediate steps in LAPACK too which I think what https://github.com/flame/libflame was supposed to be in the first place. It is just not as tidy as BLAS routines. If you can unroll things in BLAS, you can still unroll in LAPACK too. Also this assumes that there is no room for improvement in LAPACK which is quite not true. Even LU can do much better. That's what Julia/Rust folks are trying to generalize in many different codebases. But the code cannot be modified as is and has to be rewritten. The main problem is that it is just not feasible to decode all that old crust and reorganize the new code and squeeze it back into F77 without causing mayhem. After quite some thousands of line of code translation, I can guarantee you, there are low-hanging fruits left and right everywhere. You can do the same exercise, just look at our translation of minpack or quadpack C code. There are many places I know that it can be 10 lines long and 10 times faster instead of bare looping with insane-level index tracking across nested functions.

The point I wanted to make is that it is possible to do optimizations without knowing the actual algorithm if you are using C if you encounter a for loop. In f77 you need a surgery. Much easier in modern fortran but not in f77. Also it is not "practically" possible to do it in F77 unless you know everything about the algorithm because the algorithm is baked into indices such as if if (mod(j, m).eq.1) GOTO 888. Why 1? Why mod? Because f77 does not have proper data structures so every flow is guided by counting with single letter indices hiding massive opportunities of performance.

We were holding the same idea about our fortran codebase being not feasible to translate, needs unreasonable human hours, they are battle-tested, crazily optimized yada yada. And I am not that blind or that stupid, I saw them everywhere too. We had all the reasons not to do it but now we are where we are. If we do not do this step, regardless of the bugs that will seep through, we are not going to go anywhere and still be discussing ABI issues in the decades to come which is a big waste of time to treat the symptom. To be honest, I am starting to think that everybody is making too much of a deal about these old algorithms. They are not magical code written in Ocaml or Assembly but just flat out, for loops. There are just too many of them.

@alonme
Copy link

alonme commented Sep 23, 2024

@ilayn What do you think should be the next step?

A couple more questions:

  1. How good is the test coverage in LAPACK?
  2. What is the coverage of "functions" that we actually need from LAPACK which is already implemented in the rust libraries (just as an example) - it might be simpler to translate from rust to C

@ilayn
Copy link

ilayn commented Sep 23, 2024

I don't have a concrete plan yet to start the work but the following are something that would help understanding the situation;

  • Opinion of actual maintainers of these libs would be really nice. If they say this is stupid, that's also a data point. I just want to know what the current feeling is. Because this is always thought as impossible task and I tend to agree. I'm just trying to understand how much impossible it is.
  • If C is the right solution or not.
  • Should the lwork query be eliminated or not.
  • Most algorithm is written with column-major in mind. So there will be some changes needed (loop orders to start with).

and many more questions. However, before all this a few examples should be done I think such as ?gesv, ?getrf, ?potrf then by swapping the underlying BLAS vendor such OpenBLAS, BLIS and so on would give some idea of what we are dealing with.

How good is the test coverage in LAPACK?

Not sure, but there is some. They are quite strict (up to a few ULP) hence when there is a test I'd expect it to have good coverage. But this is never guaranteed.

What is the coverage of "functions" that we actually need from LAPACK which is already implemented in the rust libraries

I don't know any direct Rust versions. But many rewrites are providing the same functionality by re-coding (not translating). But I am not sure, I did not really sit down for a comprehensive research.

Note that NumPy used to avoid LAPACK by using the so-called "LAPACK LITE" in a similar fashion but it was just f2c'd LAPACK.

@itamarst
Copy link
Author

Keeping in mind that I am not a linear algebra person: I tend to think of C as legacy language at this point, and more broadly any language that lacks memory safety by default. So personally I'd go with Rust for any new infrastructure projects; syntax can be learnt, a C ABI can be exposed... My impression is that https://docs.rs/num/latest/num/complex/struct.Complex.html is the standardized Rust-specific API/ABI for complex numbers, it's used by ndarray, nalgebra, faer, and sprs crates. The matrixmultiply crate (inspired by BLIS) appears to use an experimental respresentation that is ABI-compatible with the previous one. (C ABI compat for complexs seems like more of a a pain though, because it sounds like there's no cross-architecture ABI: rust-lang/rfcs#793).

A random thought: https://github.com/immunant/c2rust transpiler is apparently successful enough to be used in some real world projects, e.g. as the starting point for a video codec (https://www.memorysafety.org/blog/porting-c-to-rust-for-av1/). One can imagine the equivalent for Fortran 77.

@ilayn
Copy link

ilayn commented Sep 24, 2024

As I tried to argue above, C is indeed the legacy language however memory safety is not an issue on these decades old code. Hence it is a matter of whether all the necessary details are included in the language. The complex number issue is one, like you point out, but also the 2D array libraries are still competing so it seems to me that there is a very high cost of choosing the wrong one.

One can imagine the equivalent for Fortran 77.

This only exists for F77->Modern Fortran with partial success so far and I don't think there is any help coming anytime soon.

@fedebenelli
Copy link

Have you guys seen that there is a Modern Fortran implementation? https://github.com/perazz/fortran-lapack

It is already included on the Fortran stdlib with preprocessing for different types: https://github.com/fortran-lang/stdlib/blob/master/src/stdlib_linalg_lapack.fypp

@ilayn
Copy link

ilayn commented Sep 24, 2024

I do and have discussed this on fortran-lang discourse. It is a reformulation of the modern function declaration with little or no code rewrite (their parsing python script is also available in case interested) hence it is not solving the problems I have mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants