Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Fast" solvers are slow for dense, complex matrix #159

Closed
simonp0420 opened this issue Jul 17, 2022 · 26 comments
Closed

"Fast" solvers are slow for dense, complex matrix #159

simonp0420 opened this issue Jul 17, 2022 · 26 comments

Comments

@simonp0420
Copy link
Contributor

According to the documentation, RFLUFactorization is supposed to be faster than backslash for matrices on the order of a few hundred. I did some quick benchmarking and I'm not finding this to be the case. Am I doing something wrong?

using LinearSolve, BenchmarkTools

n = 200
A = rand(ComplexF64, n, n)
b = rand(ComplexF64, n)

@btime $A\$b ; # 474 μs

prob = LinearProblem(copy(A), copy(b))
@btime solve($prob, LUFactorization()); # 477 μs. As expected, similar to `\`

prob = LinearProblem(copy(A), copy(b))
@btime solve($prob); # 20.4 ms.  Not sure what method is used here.

prob = LinearProblem(copy(A), copy(b))
@btime solve($prob, RFLUFactorization()); # 21.8 ms.  Forty-some times slower than BLAS LU

prob = LinearProblem(copy(A), copy(b))
method = LinearSolve.FastLUFactorization()
@btime solve($prob, $method); # 14.9 ms .  Better but still way slower than BLAS

using LinearAlgebra
BLAS.get_config()
#=
LinearAlgebra.BLAS.LBTConfig
Libraries: 
└ [ILP64] libopenblas64_.so
=#

I'm running Julia v1.7.3 and LinearSolve v1.22.1

Also, I found that FastLUFactorization is not exported so I had to invoke it as LinearSolve.FastLUFactorization().

@ChrisRackauckas
Copy link
Member

How many Julia threads are open on your system? Show versioninfo()

@ChrisRackauckas
Copy link
Member

@chriselrod

@rayegun
Copy link
Collaborator

rayegun commented Jul 17, 2022

LUFact and FastLUFact should have approximately the same time for the initial solve I think? They're both getrf right?

@chriselrod
Copy link
Contributor

Those are complex matrices.
You could make a PR to add actual complex support to RecursiveFactorization.

Or wait for the LoopVectorization rewrite to be finished.
Development is active there, but it will still be a while.

@ChrisRackauckas
Copy link
Member

Oh RecursiveFactorization.jl doesn't have a specialization for complex matrices? I didn't know that. We should specialize the default algorithm on that fact and document it then. What exactly is the issue, that it's only fast for Float32 and Float64?

@simonp0420
Copy link
Contributor Author

8 threads on my system:

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8

ChrisRackauckas added a commit that referenced this issue Jul 17, 2022
Summary:

- RecrusiveFactorization choice looks for Flaot32|Float64 and Julia threads. This should fix the main issue in #159 that RFLUFactorization is not optimized on complex matrices
- CuArrays now supports LU factorizations
- If a standard matrix, then use FastLUFactorization

That last bit still has a question mark (#159 (comment))
@ChrisRackauckas
Copy link
Member

LUFact and FastLUFact should have approximately the same time for the initial solve I think? They're both getrf right?

Yes, they should. I mean, FastLUFact is just an alternative interface to BLAS that skips rebuilding the workspace each time, so it's not that it should be as fast as BLAS, it is a BLAS call 😅 . So that's a pretty confusing result here that needs to get explored a bit more.

I opened a PR that tweaks the default algorithms a bit more: #160 . Though I think we should figure out what's going on with FastLUFactorization.

@chriselrod
Copy link
Contributor

chriselrod commented Jul 18, 2022

Oh RecursiveFactorization.jl doesn't have a specialization for complex matrices? I didn't know that. We should specialize the default algorithm on that fact and document it then. What exactly is the issue, that it's only fast for Float32 and Float64?

LoopVectorization.jl currently only supports Union{Bool,Base.HWReal}.

If we have a special need for it somewhere, I could take the time to hack on support to LoopVectorization.jl, otherwise it's better to wait for the rewrite, where it will come for free.

@simonp0420
Copy link
Contributor Author

I'm happy to wait for the rewrite. However, I'm also not seeing the advertised speedup on Float64:

Algorithm n = 200 n = 500 n = 2000
Backslash 0.2 1.3 29
(Default) 0.2 1.2 29
LUFactorization 0.2 1.3 29
RFLUFactorization 0.2 1.2 48
Fast LUFactorization 0.3 2.1 53.7

n is the order of the matrix and the table entries are times in msec reported by @btime.

@ChrisRackauckas
Copy link
Member

@simonp0420
Copy link
Contributor Author

simonp0420 commented Jul 18, 2022

So the default algorithm was selected to be RFLUFactorization for the n = 200 and n=500 cases and it shouldn't be expected to be competitive for the n = 2000 case? Sounds good. But why isn't it any faster than backslash in the two smaller cases?

@rayegun
Copy link
Collaborator

rayegun commented Jul 18, 2022

I need to investigate what's happening in FastLUFactorization. This should be taking exactly as long as Backslash, until you do a re-solve where it should be faster. Should be easy to find. I wonder if this is also true for Float64.

@ChrisRackauckas
Copy link
Member

But why isn't it any faster than backslash in the two smaller cases?

Can you run the benchmarks? https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/blob/master/perf/lu.jl

And share your CPU? It at least was about 2x faster for things we tested JuliaLinearAlgebra/RecursiveFactorization.jl#28

ChrisRackauckas added a commit that referenced this issue Jul 18, 2022
Summary:

- RecrusiveFactorization choice looks for Flaot32|Float64 and Julia threads. This should fix the main issue in #159 that RFLUFactorization is not optimized on complex matrices
- CuArrays now supports LU factorizations
- If a standard matrix, then use FastLUFactorization

That last bit still has a question mark (#159 (comment))
@simonp0420
Copy link
Contributor Author

simonp0420 commented Jul 18, 2022

julia> versioninfo(verbose=true)
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      "Manjaro Linux"
  uname: Linux 5.10.126-1-MANJARO #1 SMP PREEMPT Mon Jun 27 10:02:42 UTC 2022 x86_64 unknown
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz: 
              speed         user         nice          sys         idle          irq
       #1  4507 MHz     318248 s        206 s     296408 s    9844998 s       8758 s
       #2  4513 MHz     320355 s        109 s     297091 s    9845778 s       9752 s
       #3  4497 MHz     323075 s        112 s     298174 s    9845949 s       5462 s
       #4  4592 MHz     318678 s        205 s     301067 s    9844768 s       5241 s
       #5  4555 MHz     319502 s        390 s     298411 s    9850884 s       5040 s
       #6  4574 MHz     315547 s        477 s     299586 s    9854946 s       5172 s
       #7  4499 MHz     321453 s         81 s     295622 s    9854642 s       4808 s
       #8  4571 MHz     322751 s        583 s     295157 s    9853703 s       4787 s
       
  Memory: 62.5269775390625 GB (46829.6015625 MB free)
  Uptime: 1.04880138e6 sec
  Load Avg:  1.15  1.33  0.77
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
  DRAWHOME = /usr/share/opencascade/resources/DrawResources
  WINDOWPATH = 2
  HOME = /home/simonp
  PATH = /home/simonp/.local/bin:/usr/local/bin:/usr/bin:/var/lib/snapd/snap/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
  TERM = xterm-256color

julia> include("perf/lu.jl")
[ Info: 4 × 4
...
[ Info: 500 × 500

The created file name is "lu_float64_1.7.3_skylake_8cores_OpenBLAS.png":
lu_float64_1 7 3_skylake_8cores_OpenBLAS

Looks like the big advantage of fully recursive factorization is only for order n <= 180 or so. Still, it's really impressive that a pure Julia factorization is competitive or better than BLAS throughout the full range of the tests.

@ChrisRackauckas
Copy link
Member

We should make a version of that just do all of the LinearSolve.jl options.

I updated a few of the defaults in #160. I'll keep the cutoff at 500 for now since RF is never "bad", though I did change it to move away from RF if the user doesn't have Julia threads enabled. OpenBLAS has some bad CPUs, so that is a bit safer, until large. But FastLUFactorization will stay out of the defaults until we figure out what's going on there (it's just using BLAS, so how is it slower 😅).

@simonp0420
Copy link
Contributor Author

simonp0420 commented Jul 18, 2022

Thanks for your fast response on this issue. Looking forward to using this for complex matrices (ubiquitous in my work) in the future.

@chriselrod
Copy link
Contributor

How many threads did you use in your benchmark?

@simonp0420
Copy link
Contributor Author

Julia was started with 8 threads. BLAS threading was set by the script:

nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)

which looks like it would be 8 as well.

@simonp0420
Copy link
Contributor Author

For completeness, here is the result of bench-marking after using MKL:
lu_float64_1 7 3_skylake_8cores_MKL

@ChrisRackauckas
Copy link
Member

Yup so that fits the heuristic of the default algorithm:

(length(b) <= 100 || (isopenblas() && length(b) <= 500))

If <100 or if openBLAS less than 500, use RecursiveFactorization. Otherwise use the (provided) BLAS. So I think that's pretty close to what you measure.

I wonder if we could provide MKLLUFactorization which doesn't overload LBT?

@chriselrod
Copy link
Contributor

Maybe we could even raise the 100; MKL only beat it at 160.
MKL wins because it is much better at multithreading. I had a plot somewhere, where my computer (cascadelake) matched MKL up to 500.

I'd also be interested in seeing single threaded plots of both, which would be relevant for any code trying to do solves in parallel, and thus single threading the lu.

I wonder if we could provide MKLLUFactorization which doesn't overload LBT?

We could, by depending on MKL_jll and ccalling directly.

@simonp0420
Copy link
Contributor Author

I'd also be interested in seeing single threaded plots of both
Here is the plot for single-threaded vs OpenBLAS:
lu_float64_1 7 3_skylake_1cores_OpenBLAS
and here is the corresponding plot for MKL:
lu_float64_1 7 3_skylake_1cores_MKL

@ChrisRackauckas
Copy link
Member

Let's take benchmarking discussions to #166 . MKL to #164 . This for FastLUFactorization being slow.

@simonp0420
Copy link
Contributor Author

I'm almost done with some similar runs on Windows. Want me to put the plots in a comment in #166?

@ChrisRackauckas
Copy link
Member

Yes please!

@ChrisRackauckas
Copy link
Member

FastLUFactorization in v2 is optimized to be non-allocating and all, see #313 . But it doesn't seem to make a real difference since that workspace is not the main cost, and the copyto! can be slower. If you're solving multiple times, then maybe it's faster, but at least on v2 it doesn't look to be an implementation thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants