-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This commit is a GPU port of module_bl_mynn.F90. OpenACC was used for… #1005
base: main
Are you sure you want to change the base?
Conversation
… the port. The code was run with IM, the number of columns, equal to 10240. For 128 levels the GPU is 19X faster than one CPU core. For 256 levels the GPU is 26X faster than one CPU core. An OpenACC directive was added to bl_mynn_common.f90 While OpenACC directives are ignored by CPU compilations, extensive changes to module_bl_mynn.F90 were required to optimize for the GPU. Consequently, the GPU port of module_bl_mynn.F90, while producing bit-for-bit CPU results, runs 20% slower on the CPU. The GPU run produces results that are within roundoff of the original CPU result. The porting method was to create a stand alone driver for testing on the GPU. A kernels directive was applied to the outer I loop over columns so iterations of the outer loop are processed simultaneously. Inner loops are vectorized where possible. Some of the GPU optimizations were: Allocation is slow on the GPU. Automatic arrays are allocated upon subroutine entry so they are costly on the GPU. Consequently, automatic arrays were changed to arrays passed in as arguments and promoted to arrays indexed to the outer I loop so allocation happens only once. Variables in vector loops must be private to prevent conflicts which means allocation at the beginning of the kernel. To prevent allocation each time the I loop runs, large private arrays were promoted to arrays indexed to the outer I loop so allocation happens only once outside the kernel. Speedup is limited by DO LOOPS containing dependencies which cannot be vectorized but run on one GPU thread. The predominant dependency type is loop carried dependencies. A loop carried dependency occurs when a loop depends on values calculated in an earlier iteration. Many of these loops search for a value and then exit. There are many calls to tridiagonal solvers which have loop carried dependencies. After other optimizations, Tridiagonal solvers use 29% of the total GPU runtime. Some value searching loops were rearranged to allow vectorization. Further speedup could be achieved by restructuring more of the value searching loops so they would vectorize. Parallel tridiagonal solvers exist but would not be bit-for-bit with the current solvers and so should be implemented in cooperation with a physics expert. As currently implemented, the routine module_bl_mynn.F90 does not appear to be a good candidate for one version running efficiently on both the GPU and CPU. Routines changed are module_bl_mynn.F90 and bl_mynn_common.f90. The stand alone driver is not included.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fascinating but the timing is bad. The MYNN is (probably) within hours of being updated. We'll need to work on merging these changes into the updated version. Also, I'm a bit worried about the slowdown for CPUs if I read that right.
@joeolson42 I didn't want to say this, but yeah this will need some updating after the MYNN stuff is merged into the NCAR authoritative, which will happen after the MYNN changes are merged into the UWM fork. |
Wish someone can provide some background information about this work, and describe the overall strategy for converting the entire CCPP physics package for running on GPU chips. Is this project only for making MYNN EDMF GPU compliant, or this is part of a big project for GPU applications ? |
@yangfanglin , I think this was funded by a GSL DDRF (Director's Something Research Funding) project, way back when we still had Dom. It's only a small pot of money around ~$100K for small self-contained projects. As far as I know, there is no funding for this kind of work for all of CCPP, which highlights how NOAA's patchwork funding leaves us scrambling for crumbs. |
@joeolson42 thanks. I agree that NOAA needs to invest more in NWP model code development for GPU applications. |
You read it right, running on the CPU this version is 20% slower. I
understand that may not be acceptable. As it's written this routine,
module_bl_mynn.F90, may not be well suited for one version running well on
the GPU and CPU.
Jacques
…On Thu, Mar 23, 2023 at 9:57 PM Fanglin Yang ***@***.***> wrote:
@joeolson42 <https://github.com/joeolson42> thanks. I agree that NOAA
needs to invest more in NWP model code development for GPU applications.
—
Reply to this email directly, view it on GitHub
<#1005 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFMA7HPRNSA33ZKCDHTAWWTW5ULSDANCNFSM6AAAAAAWFULTPA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@yangfanglin I can provide little bit of background on this work. Jacques porting of the MYNN PBL to GPUs is related to a larger effort funded by NOAA Software Environments for Novel Architectures (SENA) program. As a part of this effort, Thompson microphysics, GF convective scheme and MYNN surface layer scheme have been ported to GPUs too. These three schemes showed notable improvement in performance on GPUs without degradation in performance on CPUs. Basically, we are targeting a full physics suite port to GPUs. This is also a collaborative project with CCPP team that is working on making CCPP GPU compliant to allow for comprehensive testing of "GPU physics suite". Based on Jacques results, work on GPU-izng MYNN PBL scheme will have to be further evaluated, but I also think it is important to document the progress. I hope this helps. |
Just a clarification: while the CCPP team thinks it is important to evolve the CCPP Framework to be able to distribute physics to both CPU and GPU, we do not currently have a project/funding to work on this. Depending on what priorities emerge from the upcoming CCPP Visioning Workshop, we may able to pursue this actively. |
Sorry @yangfanglin , I guess I was way off on my guess of the funding source. Clearly, I have not been involved in this process. |
Good to know all the facts and activities, but this discussion about GPU probably needs to move to a different place. A more coordinated effort would benefit all parties who are involved in developing and/or using the CCPP. GFDL, NASA/GSFC and DOE/E3SM are also working on converting their codes but taking different approaches. |
… the port. The code was run with IM, the number of columns, equal to 10240. For 128 levels the GPU is 19X faster than one CPU core. For 256 levels the GPU is 26X faster than one CPU core.
An OpenACC directive was added to bl_mynn_common.f90 While OpenACC directives are ignored by CPU compilations, extensive changes to module_bl_mynn.F90 were required to optimize for the GPU. Consequently, the GPU port of module_bl_mynn.F90, while producing bit-for-bit CPU results, runs 20% slower on the CPU. The GPU run produces results that are within roundoff of the original CPU result. The porting method was to create a stand alone driver for testing on the GPU. A kernels directive was applied to the outer I loop over columns so iterations of the outer loop are processed simultaneously. Inner loops are vectorized where possible. Some of the GPU optimizations were:
Allocation is slow on the GPU. Automatic arrays are allocated upon subroutine entry so they are costly on the GPU. Consequently, automatic arrays were changed to arrays passed in as arguments and promoted to arrays indexed to the outer I loop so allocation happens only once. Variables in vector loops must be private to prevent conflicts which means allocation at the beginning of the kernel. To prevent allocation each time the I loop runs, large private arrays were promoted to arrays indexed to the outer I loop so allocation happens only once outside the kernel. Speedup is limited by DO LOOPS containing dependencies which cannot be vectorized but run on one GPU thread. The predominant dependency type is loop carried dependencies. A loop carried dependency occurs when a loop depends on values calculated in an earlier iteration. Many of these loops search for a value and then exit. There are many calls to tridiagonal solvers which have loop carried dependencies. After other optimizations, Tridiagonal solvers use 29% of the total GPU runtime. Some value searching loops were rearranged to allow vectorization. Further speedup could be achieved by restructuring more of the value searching loops so they would vectorize. Parallel tridiagonal solvers exist but would not be bit-for-bit with the current solvers and so should be implemented in cooperation with a physics expert. As currently implemented, the routine module_bl_mynn.F90 does not appear to be a good candidate for one version running efficiently on both the GPU and CPU. Routines changed are module_bl_mynn.F90 and bl_mynn_common.f90. The stand alone driver is not included.