Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing divide-by-zero runtime error when compiled with oneAPI #33

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

manodeep
Copy link

@manodeep manodeep commented Mar 11, 2025

When compiled with oneAPI, @dougiesquire found that there was a divide-by-zero - as noted in this comment

Numerically this is the appropriate fix, but someone with knowledge of the code should confirm/clarify why the other check was there in the first place (as in, why %kmt > 0 check was being used instead of %ht > 0 when calculating the reciprocal of %ht)

@manodeep manodeep changed the title Fixing divide-by-zero error when compiled with oneAPI Fixing divide-by-zero runtime error when compiled with oneAPI Mar 11, 2025
@manodeep manodeep self-assigned this Mar 11, 2025
@manodeep manodeep requested a review from dougiesquire March 11, 2025 01:51
@manodeep
Copy link
Author

Fixes #34

@dougiesquire
Copy link

I've tested this change with the ACCESS-OM2 1deg_jra55_ryf configuration and it doesn't change answers.

However, I'd like to understand how this division by zero comes about at all? I've output the kmt and ht fields using the write_topog namelist option. There are no locations in these outputs where kmt > 0 and ht is close to zero. The minimum ht where kmt > 0 is 11.8 m, which is consistent with what is reported in the logs:

>>> import xarray as xr
>>>
>>> ds = xr.open_dataset(
...     "/g/data/tm70/ds0092/model/config/om2_1deg_jra55_ryf/archive/output000/ocean/ocean_grid_check.nc"
... )
>>> kmt = ds["kmt"]
>>> ht = ds["ht"]
>>>
>>> # Mask unprocessed blocks
>>> kmt = kmt.where(ht < 1e36)
>>>
>>> ht.where(kmt > 0).min().item()
11.805749893188477

@manodeep
Copy link
Author

@dougiesquire and I just had a zoom meeting and we are both confused about what might be going on.

To summarise:

  • With classic ifort, there is no divide-by-zero error
  • With oneAPI, there is a divide-by-zero without this patch
  • With this patch and oneAPI, the divide-by-zero disappears. However, in the code block immediately following my patch, there is a bit to determine the overall min-depth and then prints that min-depth. Searching for shallowest in the log file shows that the min-depth across all PEs is ~11.805 that Dougie noted above.

The confusion is because for a divide-by-zero to occur, %kmt > 0 and %ht==0.0 for at least one pair of i,j. Then, with the patch applied, the min-depth should have been 0.0 (based on the %ht values) instead of the ~11.8 noted.

  ! find shallowest water and provide caveat for overly shallow regions
  imin= 0 ; jmin=0 ; kmin=0 ; min_depth = 1.e10
  do j=jsc,jec
    do i=isc,iec
      if(Grid%kmt(i,j) > 0 .and. Grid%ht(i,j) < min_depth) then 
        imin      = i
        jmin      = j
        kmin      = Grid%kmt(i,j)
        min_depth = Grid%ht(i,j)
      endif 
    enddo
  enddo 
  fudge      = 1 + 1.e-12*mpp_pe() ! to distinguish processors when min_depth is independent of processor
  min_depth  = min_depth*fudge 
  min_depth0 = min_depth
  call mpp_min(min_depth)  


  if(min_depth0 == min_depth) then 
      write(writeunit,'(/a,f12.5)')' The shallowest wet ocean model grid cell has depth (meters) ', min_depth  
      write(writeunit,'(a,i3,a,i3,a,i3,a)')' and this occurs at (i,j,k) = (',&
                      imin+Domain%ioff,',',jmin+Domain%joff ,',',kmin,')'  
      write(writeunit,'(a,f10.4,a,f10.4,a,f12.5,a/)')' which has (long,lat,depth) =  (' &
           ,Grid%xt(imin,jmin),',',Grid%yt(imin,jmin), ',', Grid%ht(imin,jmin),')'
      if(min_depth < 50.0) then 
          write(writeunit,'(a)')' Beware that shallow regions (e.g., those shallower than 50m) may be subject' 
          write(writeunit,'(a)')' to numerical problems if strong surface forcing is not mixed vertically.' 
          write(writeunit,'(a)')' Such problems may occur especially in shallow regions with kmt==2.'
          write(writeunit,'(a)')' Current speeds and/or tracer deviations may become large due to the deposition'
          write(writeunit,'(a)')' of wind and/or buoyancy over just a small upper ocean region. Such problems'
          write(writeunit,'(a)')' can be resolved by adding sufficient vertical mixing in these regions.'
          write(writeunit,'(a)')' Such happens in Nature due to tides and breaking surface waves.'
      endif
  endif

@dougiesquire
Copy link

Okay, this is pretty odd. As we discussed, I added some print statements to the offending code to explicitly check the value of Grid%ht as follows:

      do i=isd,ied
         if(Grid%kmt(i,j) > 0) then
+            if(Grid%ht(i,j) < 0.1) then
+               write(writeunit,'(a,f10.4,a,f10.4,a)')' Grid%ht < 0.1 at (lon,lat)=(',Grid%xt(i,j),',',Grid%yt(i,j),')'
+               write(writeunit,'(a,i4,a,f10.4/)')' Grid%kmt = ',Grid%kmt(i,j),', Grid%ht = ',Grid%ht(i,j)
+            endif
             Grid%htr(i,j) = 1.0/Grid%ht(i,j)
         endif
      enddo

Nothing is printed by these print statements when the model is run, but the stacktrace for the divide by zero error changes when they are included:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
libpthread-2.28.s  0000154D9CD28D10  Unknown               Unknown  Unknown
fms_ACCESS-OM.x    000000000072B1A4  thickness_initial        1362  ocean_thickness.F90
fms_ACCESS-OM.x    00000000007114A1  ocean_thickness_i         630  ocean_thickness.F90
fms_ACCESS-OM.x    000000000045DDB2  ocean_model_init         1269  ocean_model.F90
fms_ACCESS-OM.x    00000000004317FC  main                      366  ocean_solo.F90
fms_ACCESS-OM.x    0000000000412A0D  Unknown               Unknown  Unknown
libc-2.28.so       0000154D9C97A7E5  __libc_start_main     Unknown  Unknown
fms_ACCESS-OM.x    000000000041292E  Unknown               Unknown  Unknown

The new stacktrace refers to another section of code where division by zero could occur. Again, I add print statements as follows:

     do i=isd,ied
        if(Grid%kmt(i,j) > 1) then
+           if(Thickness%pbot0(i,j) < 0.1) then
+               write(unit,'(a,f10.4,a,f10.4,a)')' Thickness%pbot0 < 0.1 at (lon,lat)=(',Grid%xt(i,j),',',Grid%yt(i,j),')'
+               write(unit,'(a,i4,a,f10.4/)')' Grid%kmt = ',Grid%kmt(i,j),', Thickness%pbot0 = ',Thickness%pbot0(i,j)
+           endif
            Thickness%pbot0r(i,j) = 1.0/Thickness%pbot0(i,j)
        endif
     enddo

Again, nothing is printed by these print statements but the stacktrace changes yet again to refer to another section where division by zero could occur:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
libpthread-2.28.s  00001465706B6D10  Unknown               Unknown  Unknown
fms_ACCESS-OM.x    00000000007358EF  ocean_thickness_i        1527  ocean_thickness.F90
fms_ACCESS-OM.x    000000000045E1C4  ocean_model_init         1301  ocean_model.F90
fms_ACCESS-OM.x    0000000000422EDC  main                      366  ocean_solo.F90
fms_ACCESS-OM.x    0000000000412A0D  Unknown               Unknown  Unknown
libc-2.28.so       00001465703087E5  __libc_start_main     Unknown  Unknown
fms_ACCESS-OM.x    000000000041292E  Unknown               Unknown  Unknown

Again, I added print statements to the offending code as follows:

              do i=isd,ied
                 if(Grid%kmt(i,j) > 1) then
+                    if(Thickness%pbot0(i,j) < 0.1) then
+                        write(unit,'(a,f10.4,a,f10.4,a)')' Thickness%pbot0 < 0.1 at (lon,lat)=(',Grid%xt(i,j),',',Grid%yt(i,j),')'
+                        write(unit,'(a,i4,a,f10.4/)')' Grid%kmt = ',Grid%kmt(i,j),', Thickness%pbot0 = ',Thickness%pbot0(i,j)
+                    endif
                     Thickness%pbot0r(i,j) = 1.0/Thickness%pbot0(i,j)
                 endif
              endd

And we now get a floating invalid error somewhere different:

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
libpthread-2.28.s  0000149F10860D10  Unknown               Unknown  Unknown
fms_ACCESS-OM.x    0000000000C4B4DE  ocean_vert_tidal_         817  ocean_vert_tidal.F90
fms_ACCESS-OM.x    000000000047772B  ocean_vert_mix_in         889  ocean_vert_mix.F90
fms_ACCESS-OM.x    000000000045E3FB  ocean_model_init         1322  ocean_model.F90
fms_ACCESS-OM.x    0000000000422EDC  main                      366  ocean_solo.F90
fms_ACCESS-OM.x    0000000000412A0D  Unknown               Unknown  Unknown
libc-2.28.so       0000149F104B27E5  __libc_start_main     Unknown  Unknown
fms_ACCESS-OM.x    000000000041292E  Unknown               Unknown  Unknown

Are we dealing with a compiler bug?

@dougiesquire
Copy link

When (unmodified) MOM5 is compiled with-O0 (via the +deterministic spack variant) we no longer get the floating divide by zero errors, but the floating invalid error is still thrown.

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
libpthread-2.28.s  000014A638B4ED10  Unknown               Unknown  Unknown
fms_ACCESS-OM.x    0000000000C1E2FE  ocean_vert_tidal_         817  ocean_vert_tidal.F90
fms_ACCESS-OM.x    0000000000475A8B  ocean_vert_mix_in         889  ocean_vert_mix.F90
fms_ACCESS-OM.x    000000000045C796  Unknown               Unknown  Unknown
fms_ACCESS-OM.x    000000000042F67C  main                      366  ocean_solo.F90
fms_ACCESS-OM.x    000000000041088D  Unknown               Unknown  Unknown
libc-2.28.so       000014A6387A07E5  __libc_start_main     Unknown  Unknown
fms_ACCESS-OM.x    00000000004107AE  Unknown               Unknown  Unknown

The offending line also divides by Grid%ht:

                  temporary = wrk1_2d(i,j)/(epsln+Grd%ht(i,j))

I'm starting to suspect we may have some bad value in ht.

@manodeep
Copy link
Author

@dougiesquire NaN's?

@dougiesquire
Copy link

Yeah exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants