Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterative solver on OpenCL (GPU) devices #199

Open
GoogleCodeExporter opened this issue Aug 12, 2015 · 7 comments
Open

Iterative solver on OpenCL (GPU) devices #199

GoogleCodeExporter opened this issue Aug 12, 2015 · 7 comments
Assignees
Labels
comp-Logic Related to internal code logic OpenCL Running on GPUs and similar devices performance Simulation speed, memory consumption pri-Medium Worth assigning to a milestone
Milestone

Comments

@GoogleCodeExporter
Copy link

On recent GPU devices the matrix vector multiplication in adda is as fast as 
the preparation of the next argument vector within the iterative solver 
(currently done by the CPU). Therefore the iterative solver should also run on 
GPU to avoid transferring vectors from host to device each iteration and to 
speed-up the computation. Since most of the functions executed by the iterative 
solvers in adda are level1 (vector) basic linear algebra functions, potentially 
the clAmdBlas library can be employed to improve the execution speed also. 
This would mainly improve computation speed on larger grids and high dipole 
counts.



Original issue reported on code.google.com by [email protected] on 31 May 2014 at 3:36

@GoogleCodeExporter GoogleCodeExporter added OpSys-All comp-Logic Related to internal code logic performance Simulation speed, memory consumption pri-Medium Worth assigning to a milestone OpenCL Running on GPUs and similar devices labels Aug 12, 2015
@myurkin myurkin added feature Allows new functionality and removed Type-Enhancement labels Aug 13, 2015
@myurkin myurkin added this to the 1.5 milestone Jul 10, 2018
@myurkin
Copy link
Member

myurkin commented Nov 30, 2020

This already uses clBLAS for some time - #204

@myurkin myurkin modified the milestones: 1.5, 1.6 Apr 24, 2021
@myurkin myurkin removed the feature Allows new functionality label Apr 24, 2021
@myurkin
Copy link
Member

myurkin commented Jun 2, 2024

clBLAS is not developed, which have recently caused an unusual bug - #331 . This adds motivation to switch to another library. For instance, CLBlast is both actively (and better) developed and similar API interface to that of clBLAS.

@myurkin
Copy link
Member

myurkin commented Jun 4, 2024

Another application of clBLAS is to compute inner product inside matvec. It is used only for a few iterative solvers (in particular, not for BiCG - the only only currently using clBLAS), but involves a large-buffer transfer from the GPU memory. The latter can become a bottleneck if other optimizations are implemented.

@myurkin
Copy link
Member

myurkin commented Jun 4, 2024

Another issue with clBLAS, or more generally, when the whole iteration is executed on the GPU. The only natural synchronization point is when the residual is updated (or some other scalar coefficients are computed). Therefore, timing for matrix vector product becomes completely inadequate. The only ways to fix it is either measure timing inside kernels (but I am not sure if that is possible) or add some ad hoc synchronization points. The latter may affect the performance, but not significantly (still, this can be tested). There has been similar considerations for the MPI timing, but I could not find any discussion in the issues (maybe there are some in the source code).

@myurkin
Copy link
Member

myurkin commented Jul 3, 2024

This actually applies to many OpenCL issues, but here is tests of current ocl mode in ADDA (including that with OCL_BLAS) on various GPUs (vs. seq mode on different CPUs). This has been performed together with Michel Gross. Look inside the file comparison.pdf for details, but the conclusion so far is:

  1. the main bottleneck is 3D FFT rather than moving memory to-from a GPU
  2. OCL_BLAS helps a lot for fast GPUs, because it accelerates BLAS operations (but not because it removes the memory transfers)
  3. for fast GPUs, the bottleneck is related to memory bandwidth (for 3D FFT calculation) rather than pure computational power (TFLOPs). Thus, switching to single precision (Add compile option to use single precision #119) is not expected to provide huge gains (factors of up to 64 based on TFLOPs values for some GPUs) but rather close to two-times acceleration (based on memory bandwidth).
  4. there exist other issues (Port scattered-fields calculation to the GPU #226, Port Fourier transform of the interaction matrix (D-matrix) to GPU #248) that may cause major drop of performance for some problems. So ADDA if far from being mature in this respect.

As a side note, we have never seriously considered CUDA, not to be limited by Nvidia GPUs. However, CUDA FFT routines showed themselves to be about 1.5 times faster than clFFT (in a limited number of tests). However, I guess that systematic comparison of those two should have been performed by others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-Logic Related to internal code logic OpenCL Running on GPUs and similar devices performance Simulation speed, memory consumption pri-Medium Worth assigning to a milestone
Projects
None yet
Development

No branches or pull requests

3 participants