"Fast" solvers are slow for dense, complex matrix #159

simonp0420 · 2022-07-17T02:22:13Z

According to the documentation, RFLUFactorization is supposed to be faster than backslash for matrices on the order of a few hundred. I did some quick benchmarking and I'm not finding this to be the case. Am I doing something wrong?

using LinearSolve, BenchmarkTools

n = 200
A = rand(ComplexF64, n, n)
b = rand(ComplexF64, n)

@btime $A\$b ; # 474 μs

prob = LinearProblem(copy(A), copy(b))
@btime solve($prob, LUFactorization()); # 477 μs. As expected, similar to `\`

prob = LinearProblem(copy(A), copy(b))
@btime solve($prob); # 20.4 ms.  Not sure what method is used here.

prob = LinearProblem(copy(A), copy(b))
@btime solve($prob, RFLUFactorization()); # 21.8 ms.  Forty-some times slower than BLAS LU

prob = LinearProblem(copy(A), copy(b))
method = LinearSolve.FastLUFactorization()
@btime solve($prob, $method); # 14.9 ms .  Better but still way slower than BLAS

using LinearAlgebra
BLAS.get_config()
#=
LinearAlgebra.BLAS.LBTConfig
Libraries: 
└ [ILP64] libopenblas64_.so
=#

I'm running Julia v1.7.3 and LinearSolve v1.22.1

Also, I found that FastLUFactorization is not exported so I had to invoke it as LinearSolve.FastLUFactorization().

The text was updated successfully, but these errors were encountered:

ChrisRackauckas · 2022-07-17T08:53:27Z

How many Julia threads are open on your system? Show versioninfo()

ChrisRackauckas · 2022-07-17T08:53:31Z

@chriselrod

rayegun · 2022-07-17T13:56:22Z

LUFact and FastLUFact should have approximately the same time for the initial solve I think? They're both getrf right?

chriselrod · 2022-07-17T14:10:14Z

Those are complex matrices.
You could make a PR to add actual complex support to RecursiveFactorization.

Or wait for the LoopVectorization rewrite to be finished.
Development is active there, but it will still be a while.

ChrisRackauckas · 2022-07-17T14:12:52Z

Oh RecursiveFactorization.jl doesn't have a specialization for complex matrices? I didn't know that. We should specialize the default algorithm on that fact and document it then. What exactly is the issue, that it's only fast for Float32 and Float64?

simonp0420 · 2022-07-17T14:34:29Z

8 threads on my system:

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8

Summary: - RecrusiveFactorization choice looks for Flaot32|Float64 and Julia threads. This should fix the main issue in #159 that RFLUFactorization is not optimized on complex matrices - CuArrays now supports LU factorizations - If a standard matrix, then use FastLUFactorization That last bit still has a question mark (#159 (comment))

ChrisRackauckas · 2022-07-17T14:56:44Z

LUFact and FastLUFact should have approximately the same time for the initial solve I think? They're both getrf right?

Yes, they should. I mean, FastLUFact is just an alternative interface to BLAS that skips rebuilding the workspace each time, so it's not that it should be as fast as BLAS, it is a BLAS call 😅 . So that's a pretty confusing result here that needs to get explored a bit more.

I opened a PR that tweaks the default algorithms a bit more: #160 . Though I think we should figure out what's going on with FastLUFactorization.

chriselrod · 2022-07-18T02:29:42Z

Oh RecursiveFactorization.jl doesn't have a specialization for complex matrices? I didn't know that. We should specialize the default algorithm on that fact and document it then. What exactly is the issue, that it's only fast for Float32 and Float64?

LoopVectorization.jl currently only supports Union{Bool,Base.HWReal}.

If we have a special need for it somewhere, I could take the time to hack on support to LoopVectorization.jl, otherwise it's better to wait for the rewrite, where it will come for free.

simonp0420 · 2022-07-18T03:03:58Z

I'm happy to wait for the rewrite. However, I'm also not seeing the advertised speedup on Float64:

Algorithm	n = 200	n = 500	n = 2000
Backslash	0.2	1.3	29
(Default)	0.2	1.2	29
LUFactorization	0.2	1.3	29
RFLUFactorization	0.2	1.2	48
Fast LUFactorization	0.3	2.1	53.7

n is the order of the matrix and the table entries are times in msec reported by @btime.

ChrisRackauckas · 2022-07-18T03:05:31Z

&& <= length(500)

https://github.com/SciML/LinearSolve.jl/blob/main/src/default.jl#L17

simonp0420 · 2022-07-18T03:10:58Z

So the default algorithm was selected to be RFLUFactorization for the n = 200 and n=500 cases and it shouldn't be expected to be competitive for the n = 2000 case? Sounds good. But why isn't it any faster than backslash in the two smaller cases?

rayegun · 2022-07-18T03:35:35Z

I need to investigate what's happening in FastLUFactorization. This should be taking exactly as long as Backslash, until you do a re-solve where it should be faster. Should be easy to find. I wonder if this is also true for Float64.

ChrisRackauckas · 2022-07-18T11:30:41Z

But why isn't it any faster than backslash in the two smaller cases?

Can you run the benchmarks? https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/blob/master/perf/lu.jl

And share your CPU? It at least was about 2x faster for things we tested JuliaLinearAlgebra/RecursiveFactorization.jl#28

Summary: - RecrusiveFactorization choice looks for Flaot32|Float64 and Julia threads. This should fix the main issue in #159 that RFLUFactorization is not optimized on complex matrices - CuArrays now supports LU factorizations - If a standard matrix, then use FastLUFactorization That last bit still has a question mark (#159 (comment))

simonp0420 · 2022-07-18T16:28:55Z

julia> versioninfo(verbose=true)
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      "Manjaro Linux"
  uname: Linux 5.10.126-1-MANJARO #1 SMP PREEMPT Mon Jun 27 10:02:42 UTC 2022 x86_64 unknown
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz: 
              speed         user         nice          sys         idle          irq
       #1  4507 MHz     318248 s        206 s     296408 s    9844998 s       8758 s
       #2  4513 MHz     320355 s        109 s     297091 s    9845778 s       9752 s
       #3  4497 MHz     323075 s        112 s     298174 s    9845949 s       5462 s
       #4  4592 MHz     318678 s        205 s     301067 s    9844768 s       5241 s
       #5  4555 MHz     319502 s        390 s     298411 s    9850884 s       5040 s
       #6  4574 MHz     315547 s        477 s     299586 s    9854946 s       5172 s
       #7  4499 MHz     321453 s         81 s     295622 s    9854642 s       4808 s
       #8  4571 MHz     322751 s        583 s     295157 s    9853703 s       4787 s
       
  Memory: 62.5269775390625 GB (46829.6015625 MB free)
  Uptime: 1.04880138e6 sec
  Load Avg:  1.15  1.33  0.77
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
  DRAWHOME = /usr/share/opencascade/resources/DrawResources
  WINDOWPATH = 2
  HOME = /home/simonp
  PATH = /home/simonp/.local/bin:/usr/local/bin:/usr/bin:/var/lib/snapd/snap/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
  TERM = xterm-256color

julia> include("perf/lu.jl")
[ Info: 4 × 4
...
[ Info: 500 × 500

The created file name is "lu_float64_1.7.3_skylake_8cores_OpenBLAS.png":

Looks like the big advantage of fully recursive factorization is only for order n <= 180 or so. Still, it's really impressive that a pure Julia factorization is competitive or better than BLAS throughout the full range of the tests.

ChrisRackauckas · 2022-07-18T16:32:05Z

We should make a version of that just do all of the LinearSolve.jl options.

I updated a few of the defaults in #160. I'll keep the cutoff at 500 for now since RF is never "bad", though I did change it to move away from RF if the user doesn't have Julia threads enabled. OpenBLAS has some bad CPUs, so that is a bit safer, until large. But FastLUFactorization will stay out of the defaults until we figure out what's going on there (it's just using BLAS, so how is it slower 😅).

simonp0420 · 2022-07-18T16:37:13Z

Thanks for your fast response on this issue. Looking forward to using this for complex matrices (ubiquitous in my work) in the future.

chriselrod · 2022-07-18T17:10:13Z

How many threads did you use in your benchmark?

simonp0420 · 2022-07-18T19:44:58Z

Julia was started with 8 threads. BLAS threading was set by the script:

nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)

which looks like it would be 8 as well.

simonp0420 · 2022-07-18T19:47:49Z

For completeness, here is the result of bench-marking after using MKL:

ChrisRackauckas · 2022-07-18T19:57:34Z

Yup so that fits the heuristic of the default algorithm:

(length(b) <= 100 || (isopenblas() && length(b) <= 500))

If <100 or if openBLAS less than 500, use RecursiveFactorization. Otherwise use the (provided) BLAS. So I think that's pretty close to what you measure.

I wonder if we could provide MKLLUFactorization which doesn't overload LBT?

chriselrod · 2022-07-18T22:43:07Z

Maybe we could even raise the 100; MKL only beat it at 160.
MKL wins because it is much better at multithreading. I had a plot somewhere, where my computer (cascadelake) matched MKL up to 500.

I'd also be interested in seeing single threaded plots of both, which would be relevant for any code trying to do solves in parallel, and thus single threading the lu.

I wonder if we could provide MKLLUFactorization which doesn't overload LBT?

We could, by depending on MKL_jll and ccalling directly.

simonp0420 · 2022-07-19T00:45:17Z

I'd also be interested in seeing single threaded plots of both
Here is the plot for single-threaded vs OpenBLAS:

and here is the corresponding plot for MKL:

ChrisRackauckas · 2022-07-19T00:59:01Z

Let's take benchmarking discussions to #166 . MKL to #164 . This for FastLUFactorization being slow.

simonp0420 · 2022-07-19T01:31:06Z

I'm almost done with some similar runs on Windows. Want me to put the plots in a comment in #166?

ChrisRackauckas · 2022-07-19T01:32:17Z

Yes please!

ChrisRackauckas · 2023-06-18T14:42:42Z

FastLUFactorization in v2 is optimized to be non-allocating and all, see #313 . But it doesn't seem to make a real difference since that workspace is not the main cost, and the copyto! can be slower. If you're solving multiple times, then maybe it's faster, but at least on v2 it doesn't look to be an implementation thing.

ChrisRackauckas mentioned this issue Jul 17, 2022

Many tweaks to the default algorithm choice #160

Merged

simonp0420 mentioned this issue Jul 17, 2022

Correct documentation for RFLUFactorization #162

Merged

ChrisRackauckas mentioned this issue Jul 18, 2022

MKLFactorization for direct usage of MKL without LibBLASTrampoline #164

Closed

simonp0420 mentioned this issue Jul 19, 2022

Benchmarking Scripts to tune the default algorithm choices #166

Closed

ChrisRackauckas closed this as completed Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Fast" solvers are slow for dense, complex matrix #159

"Fast" solvers are slow for dense, complex matrix #159

simonp0420 commented Jul 17, 2022

ChrisRackauckas commented Jul 17, 2022

ChrisRackauckas commented Jul 17, 2022

rayegun commented Jul 17, 2022

chriselrod commented Jul 17, 2022

ChrisRackauckas commented Jul 17, 2022

simonp0420 commented Jul 17, 2022

ChrisRackauckas commented Jul 17, 2022

chriselrod commented Jul 18, 2022 •

edited

Loading

simonp0420 commented Jul 18, 2022

ChrisRackauckas commented Jul 18, 2022

simonp0420 commented Jul 18, 2022 •

edited

Loading

rayegun commented Jul 18, 2022 •

edited

Loading

ChrisRackauckas commented Jul 18, 2022

simonp0420 commented Jul 18, 2022 •

edited

Loading

ChrisRackauckas commented Jul 18, 2022

simonp0420 commented Jul 18, 2022 •

edited

Loading

chriselrod commented Jul 18, 2022

simonp0420 commented Jul 18, 2022

simonp0420 commented Jul 18, 2022

ChrisRackauckas commented Jul 18, 2022

chriselrod commented Jul 18, 2022

simonp0420 commented Jul 19, 2022

ChrisRackauckas commented Jul 19, 2022

simonp0420 commented Jul 19, 2022

ChrisRackauckas commented Jul 19, 2022

ChrisRackauckas commented Jun 18, 2023

"Fast" solvers are slow for dense, complex matrix #159

"Fast" solvers are slow for dense, complex matrix #159

Comments

simonp0420 commented Jul 17, 2022

ChrisRackauckas commented Jul 17, 2022

ChrisRackauckas commented Jul 17, 2022

rayegun commented Jul 17, 2022

chriselrod commented Jul 17, 2022

ChrisRackauckas commented Jul 17, 2022

simonp0420 commented Jul 17, 2022

ChrisRackauckas commented Jul 17, 2022

chriselrod commented Jul 18, 2022 • edited Loading

simonp0420 commented Jul 18, 2022

ChrisRackauckas commented Jul 18, 2022

simonp0420 commented Jul 18, 2022 • edited Loading

rayegun commented Jul 18, 2022 • edited Loading

ChrisRackauckas commented Jul 18, 2022

simonp0420 commented Jul 18, 2022 • edited Loading

ChrisRackauckas commented Jul 18, 2022

simonp0420 commented Jul 18, 2022 • edited Loading

chriselrod commented Jul 18, 2022

simonp0420 commented Jul 18, 2022

simonp0420 commented Jul 18, 2022

ChrisRackauckas commented Jul 18, 2022

chriselrod commented Jul 18, 2022

simonp0420 commented Jul 19, 2022

ChrisRackauckas commented Jul 19, 2022

simonp0420 commented Jul 19, 2022

ChrisRackauckas commented Jul 19, 2022

ChrisRackauckas commented Jun 18, 2023

chriselrod commented Jul 18, 2022 •

edited

Loading

simonp0420 commented Jul 18, 2022 •

edited

Loading

rayegun commented Jul 18, 2022 •

edited

Loading

simonp0420 commented Jul 18, 2022 •

edited

Loading

simonp0420 commented Jul 18, 2022 •

edited

Loading