[cuSolver] Avoid repeated ctxCreate/Destroy for all Lapack API calls. #298
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Mainly improve the performance of Lapack for CUDA backend by avoiding repeated cuCtxCreate/Destroy calls.
Apply the same logic as cuBlas to cuSolver at placedContext_.
This could avoid calling cuCtxCreate & cuCtxDestroy every time when using multiple lapck APIs.
For example, when solving Ax=b with Cholesky factorization, one needs to use both lapack::potrf & lapack::potrs APIs.
cuCtxCreate/Destroy takes much longer than most GPU lapack kernels, see the below images from nvvp diagnostics:
Before modification:
After modification:
Fix deprecation warnings from cuda.hpp for cuSolver ([BLAS] fix deprecation warnings from cuda.hpp #295).
Fix the bug in dft (mklgpu => mklcpu).
Checklist
All Submissions
Do all unit tests pass locally? Attach a log.
unit_test_lapack.txt
unit_test_rand.txt
unit_test_blas.txt
Have you formatted the code using clang-format?