Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Landuse_ml problems #29

Closed
JohnJohanssonChalmers opened this issue Oct 23, 2017 · 30 comments
Closed

Landuse_ml problems #29

JohnJohanssonChalmers opened this issue Oct 23, 2017 · 30 comments
Assignees
Labels
Milestone

Comments

@JohnJohanssonChalmers
Copy link

I seem to have no problem running EMEP in 1, 2 or 3 process (using for instance mpiexec -np 3 Unimod), but when I try to use 4 processes or more, EMEP crashes with the following output:

 InitLanduse: nFluxVegs=            3
 Inputs.Landuse not found
 InitLanduse: Into CDF 
 RdLanduseCDF: Starting           2           1
 MapFile /misc/orsbackup/backup/Photosmog_China/modeling/simulations/EMEP/input/EMEP_MSC-W_model.rv4.15.OpenSource/input/Landuse_PS_5km_LC.nc
 MapFile /misc/orsbackup/backup/Photosmog_China/modeling/simulations/EMEP/input/EMEP_MSC-W_model.rv4.15.OpenSource/input/glc2000mCLM.nc
RdLanduseCDF:LANDUSE: found  1 .../Landuse_PS_5km_LC.nc
RdLanduseCDF:LANDUSE: found  2 .../glc2000mCLM.nc
 LandDefs DONE           33                       0.99989318987354636       -9.9000000000000000E+019
 CDFLAND_CODES:           32  :
CF                  DF                  NF                  BF                  TC                  
MC                  RC                  SNL                 GR                  MS                  
WE                  TU                  DE                  W                   ICE                 
U                   BARE                NDLF_EVGN_TMPT_TREE NDLF_EVGN_BORL_TREE NDLF_DECD_BORL_TREE 
BDLF_EVGN_TROP_TREE BDLF_EVGN_TMPT_TREE BDLF_DECD_TROP_TREE BDLF_DECD_TMPT_TREE BDLF_DECD_BORL_TREE 
BDLF_EVGN_SHRB      BDLF_DECD_TMPT_SHRB BDLF_DECD_BORL_SHRB C3_ARCT_GRSS        C3_NARC_GRSS        
C4_GRSS             CROP                
       RdLanduseCDF: SumFrac Error    0   1  17   1  17      0.0000  75  46   1   1  75  46  15.00  82.75
 lat/lon:       RdLanduseCDF: SumFrac Error    0   1  17   1  17      0.0000  75  46   1   1  75  46  15.00  82.75   15.000000000000000        82.750000000000000     
 STOP-ALL ERROR:       RdLanduseCDF: SumFrac Error    0   1  17   1  17      0.0000  75  46   1   1  75  46  15.00  82.75
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Any idea what is causing this?

@avaldebe
Copy link
Collaborator

Is this rv4_15? What is the size of your domain?

@avaldebe
Copy link
Collaborator

avaldebe commented Oct 24, 2017

@gitpeterwind

The error message comes from Landuse_ml.f90:L700

          if (  sumfrac < 0.99 .or. sumfrac > 1.01 ) then
               write(unit=errmsg,fmt="(a34,5i4,f12.4,6i4,2f7.2)") & !nb len(dtxt)=13
                 dtxt//" SumFrac Error ", me,i,j,  &
                    i_fdom(i),j_fdom(j), sumfrac, limax,  ljmax, &
                       i_fdom(1), j_fdom(1), i_fdom(limax), j_fdom(ljmax), &
                         glat(i,j), glon(i,j)
               print *, trim(errmsg)

               if(abs(sumfrac-1.0)<0.2.and.abs(glat(i,j))>89.0)then
                  write(*,*)'WARNING: ',trim(errmsg),sumfrac,glat(i,j)
               else
                   write(*,*)'lat/lon: ',trim(errmsg),glat(i,j), glon(i,j)
                 call CheckStop(errmsg)
               end if
          end if

sumfrac is derived from landuse_in on the lines previous to the error message

           do lu = 1, NLand_codes
              if ( landuse_in(i,j,lu) > 0.0 ) then

                 call GridAllocate("LANDUSE",i,j,lu,NLUMAX, &
                         index_lu, maxlufound, landuse_codes, landuse_ncodes)
   
                     landuse_data(i,j,index_lu) = &
                       landuse_data(i,j,index_lu) + 0.01 * landuse_in(i,j,lu)
               end if
               if ( DEBUG%LANDUSE>0 .and. dbgij )  &
                       write(*,"(a15,i3,f8.4,a10,i3,f8.4)") "DEBUG Landuse ",&
                          lu, landuse_in(i,j,lu), &
                           "index_lu ", index_lu, landuse_data(i,j,index_lu)
           end do ! lu
          LandCover(i,j)%ncodes  = landuse_ncodes(i,j)
          LandCover(i,j)%codes(:) = landuse_codes(i,j,:)
          LandCover(i,j)%fraction(:)  = landuse_data(i,j,:)
          sumfrac = sum( LandCover(i,j)%fraction(:) )

landuse_in is calculated from landuse_glob on Landuse_ml.f90:L600

            if(landuse_tot(i,j)< 0.99999 ) then
              landuse_in(i,j,:)= 0.0  ! Will overwrite all PS stuff
              dbgsum = 0.0

              do ilu = 1, NLand_codes
                landuse_in(i,j,ilu) = min(1.0, landuse_glob(i,j,ilu) )
                dbgsum = dbgsum + landuse_in(i,j,ilu)
                if ( dbgij ) then
                   write(*, "(a,i3,3es15.6,1x,a)") "F4 ", ilu, &
                      landuse_in(debug_li,debug_lj,ilu), &
                      landuse_tot(debug_li,debug_lj), dbgsum,&
                      trim(Land_Codes(ilu))
                end if
              end do

            end if ! land_tot<0.9999

This looks to me like an error on the interpolation routine that reads landuse_glob
on Landuse_ml.f90:L558.

          call ReadField_CDF(trim(fName),varname,& 
               landuse_tmp,1,interpol='conservative', &
               needed=.true.,debug_flag=.false.,UnDef=-9.9E19) 

          if ( ifile == 1 ) then
               landuse_in(:,:,lu) = landuse_tmp
               landuse_tot(:,:) = landuse_tot(:,:) + landuse_tmp
          else
               landuse_glob(:,:,lu) = landuse_tmp ! will merge below
          end if

What do you think?

@gitpeterwind
Copy link
Member

The interpolation routines that interpolates from lonlat grid to lonlat grid, are very complex. That means there is always the possibility that some corner case is not correctly handled. However we have had several cases with "landuse sumfrac errors" which all were due to wrong inputs. Anyway it would be necessary to reproduce the error before being able to trace this back.
We would need: one day of metdata and the configfile used.

@mifads
Copy link
Contributor

mifads commented Oct 24, 2017

Hi Folks,
just back from the dentist but working home ...
I can probably help here. John, can you point me to the directory being used, and if I can't see what's wrong I can upload your settings/files to the Norwegian systems for a closer look. Thanks.

@JohnJohanssonChalmers
Copy link
Author

You can download my configfile and one day of metdata from here:
https://chalmersuniversity.box.com/s/eesp6kx6t5wmfar6g9zjohnz9iv67wwk

@JohnJohanssonChalmers
Copy link
Author

Hi Dave! So you were also at the dentist this morning?
I'll send you an email with the directory paths on jacinth.

@mifads
Copy link
Contributor

mifads commented Oct 24, 2017

A general issue is that we shouldn't need the European data at all for an Asia run, and I can try to check that out (the current code was hacked together before the summer, but should be cleaned and improved anyway). Why the code works starts to fail with 4 or more processors sounds like a ReadField corner case, as suggested above.

@gitpeterwind
Copy link
Member

gitpeterwind commented Oct 24, 2017

I tried with the metdata and the settings from John, and run on 4 processors without problems.
I noticed that in your output

LandDefs DONE           33                       0.99989318987354636       -9.9000000000000000E+019

the last number shows that it has not been attempted overwritten by global data (I get zero there).

Somewhat one of the two "if" failed:

 if  ( EuroFileFound .and. GlobFileFound ) then ! we need to merge
 if(landuse_tot(i,j)< 0.99999 ) then

I do not think it is an interpolation issue. Dave, if you can reproduce the error on Stallo I can find out. But maybe it would be better to try on "jacinth".

@JohnJohanssonChalmers
Copy link
Author

Is this rv4_15? What is the size of your domain?

Yes, it's rv4_15. The domain size in this case was 150x93.

@mifads
Copy link
Contributor

mifads commented Oct 24, 2017

Hi @gitpeterwind ,
I'll check on stallo or vilje first since I haven't actually used the Chalmers computers for runs yet. I am puzzled as to why John can run with 1-3 processors, but not 4 or more, but I'll start by re-checking the logic of that EuroFleFound stuff.

@mifads
Copy link
Contributor

mifads commented Oct 24, 2017

Hi again @gitpeterwind
where is that stallo test? I just realised that my usual run.pl settings won't work for John's China domain, so I assume you have used some modrun.sh type setup?

@gitpeterwind
Copy link
Member

The piece of code above with "EuroFileFound" was actually taken from an older version. The problem is either

 if(landuse_tot(i,j)< 0.99999 ) then

or landuse_glob which is wrong in

              do ilu = 1, NLand_codes
                landuse_in(i,j,ilu) = min(1.0, landuse_glob(i,j,ilu) )

@gitpeterwind
Copy link
Member

~mifapw/emep/emep-mscw/run.pl

The important part is in config:

  meteo     = '/global/work/mifapw/isue29/wrfout_d01_2016-03-25_00_00_00',

@gitpeterwind
Copy link
Member

gitpeterwind commented Oct 24, 2017

(and "isue29" is not a typo, but if you write issue with two "s", it will be replaced by 00... another issue!)

@JohnJohanssonChalmers
Copy link
Author

As per Dave's suggestion, I tried to exclude the European data, by changing:

  LandCoverInputs%MapFile   = 'DataDir/Landuse_PS_5km_LC.nc',
                              'DataDir/glc2000mCLM.nc',

to:

  LandCoverInputs%MapFile   = 'DataDir/glc2000mCLM.nc',

Now I get a totally different error that doesn't depend on the number of processes:

 InitLanduse: nFluxVegs=            3
 Inputs.Landuse not found
 InitLanduse: Into CDF 
 RdLanduseCDF: Starting           2           1
 MapFile /misc/orsbackup/backup/Photosmog_China/modeling/simulations/EMEP/input/EMEP_MSC-W_model.rv4.15.OpenSource/input/glc2000mCLM.nc
 MapFile NOTSET
RdLanduseCDF:LANDUSE: found  1 .../glc2000mCLM.nc
 STOP-ALL ERROR: RdLanduseCDF:LANDUSE: NOT found NOTSET
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

Does this mean that EMEP can't find glc2000mCLM.nc? That's strange.

I also figured maybe there always has to be two landuse files, so I tried this:

  LandCoverInputs%MapFile   = 'DataDir/glc2000mCLM.nc',
                              'DataDir/glc2000mCLM.nc',

Now, I got this error instead (also independent on the number of processes):

Deriv:MISC SURF_ppbC_VOCYMD   VOC
 Wet deposition output: WDEP_PREC ug/m3
 Wet deposition output: WDEP_SOX mgS/m2
 Wet deposition output: WDEP_OXN mgN/m2
 Wet deposition output: WDEP_RDN mgN/m2
 Wet deposition output: WDEP_SO2 mgS/m2
 Wet deposition output: WDEP_HNO3 mgN/m2
 Derived VOC setup returns           68 vocs
    indices 
  6 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
 30 31 32 34 35 37 38 39 40 45 46 47 48 50 51 52 53 54 55 56
 57 58 68 69 70 71 72 73 74 75 76 79 80 82 84 85 86 87 88 89
 95 96 97 98 99105106107
    carbons 
  2  2  2  3  5  4  1  2  2  4  2  3  8  5 10 10 10 10  1  2
  4  2  3  4  5  2  1  2  3  5  4  4  4  1  5  5  1  4  5  5
  5  4  1  1  1  1  1  1  1  1  1  1  1  1 14  1  1  1  1  1
  1  1  1  1  1  1  1  1
 SOILNOX ispec            2

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7FD46F72DE08
#1  0x7FD46F72CF90
#2  0x7FD46EC1F4AF
#3  0x586D70 in __netcdf_ml_MOD_readfield_cdf
#4  0x43D23D in __biogenics_ml_MOD_geteurobvoc
#5  0x43EC35 in __biogenics_ml_MOD_init_bvoc
#6  0x613B00 in MAIN__ at Unimod.f90:?

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

@gitpeterwind
Copy link
Member

gitpeterwind commented Oct 24, 2017 via email

@mifads
Copy link
Contributor

mifads commented Oct 24, 2017

I have the wrf meteo running on vilje now (though had to modify $GRID to GLOB, DEGREE_DAY_FACTORS to F, USE_WRF_MET_NAMES = T, emis etc.). I had problems with ForestFires (even after copying the 2016 data from stallo to vilje), but will come back to that. Tried 3 variations of cpu/processor:

##PBS -l select=4:ncpus=32:mpiprocs=32 -v MPI_MSGS_MAX=2097152,MPI_BUFS_PER_PROC=2048
##PBS -l select=1:ncpus=16:mpiprocs=16 -v MPI_MSGS_MAX=2097152,MPI_BUFS_PER_PROC=2048
#PBS -l select=1:ncpus=4:mpiprocs=4 -v MPI_MSGS_MAX=2097152,MPI_BUFS_PER_PROC=2048

and all worked, so I can't reproduce John's error there. Tomorrow I'll be in Chalmers and we can take a closer look.

@JohnJohanssonChalmers
Copy link
Author

It seems it is not enough to define one file. But you could try to simply link twice to the glc file

Yes, that is what I did, but then I ran into some other problems. Look at the end of my last post.

@mifads
Copy link
Contributor

mifads commented Oct 24, 2017

Hi @JohnJohanssonChalmers

Change Landuse_ml :

  1. By lines 483, ..
    landuse_in  = 0.0              !***  initialise  ***
    landuse_glob  = 0.0              !***  initialise  ***

add also landuse_tot = 0.0

  1. and by ca. line 562, change:
        if ( ifile == 1 ) then
               landuse_in(:,:,lu) = landuse_tmp
               landuse_tot(:,:) = landuse_tot(:,:) + landuse_tmp

to be

        if ( ifile == 1 ) then
           where (landuse_tmp>0.0)    !Oct2017
               landuse_in(:,:,lu) = landuse_tmp
               landuse_tot(:,:) = landuse_tot(:,:) + landuse_tmp
           end where  !Oct2017

The first change is a simple initialisation that should always be done. The second is needed since the file-1 data might be undefined for the modelling area (or individual cells), as for example when running in Asia but file-1 is European. With the where statement we simply ensure that no attempt is made to use the data, and the file-2 global data will be used instead.

The code still needs improvement, but try the above.

@mifads
Copy link
Contributor

mifads commented Oct 24, 2017

Hi @gitpeterwind @avaldebe
I just git pushed the above Landuse_ml changes to dev.

@JohnJohanssonChalmers
Copy link
Author

Great Dave! This seems to have solved it. I can now run on as many processors as I like.

But I still need to include the European landuse file to make the it work. Specifying only glc2000mCLM.nc or giving the same file twice, still gives errors as described above. Maybe this is something you want to look into too.

Also, just to check: The simulations that I did before this fix (using less than 4 processors) should still be ok, right? There's no reason to suspect that this bug caused any silent errors in the results?

@mifads
Copy link
Contributor

mifads commented Oct 24, 2017

Hi John, good we solved one problem anyway! Yeh, the code expects both files; that was just part of the hack done months ago. I need to re-write that one day, but probably not this week. About the earlier simulations, then I am not sure. There is a danger that things will change (it is always hard to know with initialisation issues).

@mifads mifads added Solved and removed bug labels Oct 24, 2017
@mifads mifads changed the title EMEP crashing when running four processes or more Landuse_ml problems for non-Euopean domains (EMEP crashing when running four processes or more) Oct 24, 2017
@mifads
Copy link
Contributor

mifads commented Oct 24, 2017

I just changed the heading so people don't think that the model has general problems with many processors. This particular case was an Asian domain running off WRF meteorology. /Dave

@JohnJohanssonChalmers
Copy link
Author

It's ok that you changed the heading, but I don't think this bug was THAT specific. I had similar problems earlier when running the code for a European domain stretching just slightly outside the grid of the European landuse file. And that problem was not limited to running on multiple processes. My solution was then to use RUNDOMAIN to trim the domain to fit inside the European landuse grid.

Missing to initialize to 0 can be really tricky bugs, because a lot of the time memory will be all zeros anyway. You never know what might cause the bug to suddenly appear.

@mifads mifads changed the title Landuse_ml problems for non-Euopean domains (EMEP crashing when running four processes or more) Landuse_ml problems Oct 25, 2017
@mifads
Copy link
Contributor

mifads commented Oct 25, 2017

OK, point taken. The new title is now very general ;-)
So far we haven't seen any sign of problems when running with the EECCA domain, likely as the 'PS' landcover map fills this space completely. Still, I am not 100% sure that the code was safe, and initialisation should have been done.

@gitpeterwind
Copy link
Member

gitpeterwind commented Nov 16, 2017

It seems it is not enough to define one file. But you could try to simply link twice to the glc file

Yes, that is what I did, but then I ran into some other problems. Look at the end of my last post.

The GetEuroBVOC routine in Biogenics_ml.f90 needed CF,DF,NF and BF to be defined. Those are defined by the Landuse_PS_5km_LC.nc
To correct for this, you can avoid the error by writing (line 289):

ibvoc = find_index( VegName(iveg), LandDefs(:)%code )
if( ibvoc<0 ) cycle

@avaldebe
Copy link
Collaborator

This issue is label as solved. Why is till open?

@gitpeterwind
Copy link
Member

There are still small problems: you cannot specify only one landuse for example.
Also the glc2000 should be updated with glc2015 (Dave is working on it)

@gitpeterwind
Copy link
Member

And while commenting landuse improvements: The Landuse_PS_5km_LC.nc takes a long time to read for fine resolutions. It is slow also when the rundomain is covering a region outside Europe. A simple test could accelerate this. (A temporary fix is to specify the glc2000 twice in the config file)

@avaldebe avaldebe added this to the 2019 Release milestone Feb 14, 2019
@avaldebe
Copy link
Collaborator

avaldebe commented Jul 2, 2019

As far as I can tell, this was addressed on rv4_32.
Please reopen if necesary.

@avaldebe avaldebe closed this as completed Jul 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants