Maximum Runtime? #64

douglowe · 2019-09-04T10:30:57Z

I'm trying to run EMEP for a whole year - however I'm finding that my simulations silently fail after exactly 150 days (or 3600 hours), regardless of what my start date is. I can't find any obvious namelist option to control this, could you tell me if there is any way to change this behaviour, or should I run the model individually for each month instead, to avoid this problem?

gitpeterwind · 2019-09-04T10:49:21Z

We do not have any parameter that directly control how much time has passed. Could it be that some other limitation of your system is reached (cpu time or disk space)?

douglowe · 2019-09-04T11:20:04Z

Okay - I'm not missing an obvious namelist option then.

Disk space we're fine for, and the (relative) point at which the simulation stops is exactly the same each time I've tested this (and I've used more cpu time for other simulations). I don't think it is anything external to EMEP, especially as there's no error messages thrown when it stops. This is all I get in the log file:
current date and time: 2017-10-28 17:20:00
current date and time: 2017-10-28 17:40:00
Nest:write data domain1_output/EMEP_OUT_20171028.nc
current date and time: 2017-10-28 18:00:00
Warning: not reading all levels 4 1 SMOIS
Warning: not reading all levels 4 3 SMOIS
current date and time: 2017-10-28 18:20:00
current date and time: 2017-10-28 18:40:00
current date and time: 2017-10-28 19:00:00
current date and time: 2017-10-28 19:20:00
current date and time: 2017-10-28 19:40:00
current date and time: 2017-10-28 20:00:00
current date and time: 2017-10-28 20:20:00
current date and time: 2017-10-28 20:40:00
Nest:write data domain1_output/EMEP_OUT_20171028.nc
current date and time: 2017-10-28 21:00:00
MetRd:reading ../wrf_meteo/EMEP_grid/wrfout_d01_2017-10-29_00
Warning: not reading all levels 4 1 SMOIS
Warning: not reading all levels 4 3 SMOIS

Are there any namelist debug flags would you recommend I switch on, to see if I can get more diagnostics for this?

gitpeterwind · 2019-09-04T11:31:12Z

It is not easy to test when this is happening so far out in the run. Maybe @avaldebe has some ideas?
It may be useful to compile with traceback options, so that we get the position of the code that is failing. For Intel we use those extensive flags:
-check all -check noarg_temp_created -debug-parameters all -traceback -ftrapuv -g -fpe0 -fp-stack-check
(PS: I will anyway not have time to spend time on this before October)

douglowe · 2019-09-04T12:15:48Z

Sure thing - I'll try compiling with traceback flags, and will let you know if this throws up anything (and will work around the limitation for the moment).

avaldebe · 2019-09-04T12:45:42Z

Hi @douglowe

Please take into account that the debugging flags will make the code run considerably slower. If you problem is CPU time, you should crash earlier on the simulation.

gitpeterwind · 2019-09-04T12:47:04Z

(that is why I did not include the -O0 flag ! That should still leave the traceback)

douglowe · 2019-09-04T13:14:28Z

sure thing - I'll take the change in compute time when I'm checking the results.

gitpeterwind · 2019-09-26T13:34:35Z

Did you find out what went wrong?

douglowe · 2019-09-26T14:28:37Z

Not yet, sorry - I've been busy with other projects since raising this issue. I should have time in October to investigate, and will let you know if I find anything.

avaldebe · 2020-09-04T05:28:10Z

@douglowe

Did you find the problem?

douglowe · 2020-09-04T08:35:29Z

I didn't, sorry. Decided in the end to run EMEP in 2 month chunks (makes more sense operationally anyway, as we can then parallise the work more).

avaldebe · 2020-09-04T10:08:22Z

I used to do something similar on an older version of the model. The data assimilation modules were not as well tuned as they are on our current development versions and a whole year run did not fit on the HPC queue.

At the time, I had to take care to run the chunks sequentially and create a restart file for the next chunk. Otherwise PM could be too low at the begging of each chunk.

mifads · 2020-11-18T16:34:46Z

I just saw this. Very strange problem that I have never seen or heard of before. @mvieno often runs WRF+EMEP, I think for a year at a time?

mvieno · 2020-11-18T17:19:19Z

I just saw this. Very strange problem that I have never seen or heard of before. @mvieno often runs WRF+EMEP, I think for a year at a time?

Yes, I routinely run EMEP-WRF for a year. This for any domains, form global to regional. But I use a single EMEP_OUT for the full year. The only time I had a similar issue in my case was related to the NetCDF compiled without the large file feature switched on.

douglowe · 2020-11-20T11:18:03Z

Ahh - I had not checked the NetCDF library compilation settings. I'll have a look at these, see if that might have been an issue.

douglowe · 2022-07-05T15:38:50Z

Revisiting this question - I'm now finding that my EMEP simulations for one domain fail after ~50 days. Previously I would run this domain for 2 months (+ 7 days spin-up). However, I changed the WRF output files I use to drive EMEP to be hourly, not 3 hourly, and now I always hit the memory limit (~192Gb) for the HPC node I'm running on. Looking back at the memory usage for my previous runs driven by 3-hourly data I can see that the memory usage only gets up to ~90Gb at the end of the 2 months. And polling of the memory usage during a simulation shows that it is increasing as I go through the simulation.

I'm running EMEP release 3.44, with some local fixes for reading emission sectors (https://github.com/UoMResearchIT/emep-ctm/tree/source_UoM_CSF3_gfortran). We are reading a lot of emission data (using the UK NAEI dataset), so perhaps this is a contributing factor to the size of the memory usage (and maybe why you've not encountered this problem before)? Do you have any suggestions on how I might solve this problem? Is the best answer to use restart files and just run 1 month at a time?

gitpeterwind · 2022-07-06T07:19:22Z

Yes, this is a problem we also noticed. In early releases we opened and closed the NetCDF file for each variable; that was very slow. So we kept the file open until all variables are read. That was fast, but on some systems that caused a huge amount of memory being used when many variables (~1000) were read.
In the latest release we therefore close and reopen the file for every 200 read(!) Primitive, but efficient. (subroutine Emis_GetCdf in EmisGet_mod.f90)

douglowe · 2022-07-06T09:47:42Z

Thanks for the suggestion - that sounds like a good solution to me. I'll have a look at the code in the latest release, and will backport this solution to my working copy.

avaldebe added the question label Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximum Runtime? #64

Maximum Runtime? #64

douglowe commented Sep 4, 2019

gitpeterwind commented Sep 4, 2019

douglowe commented Sep 4, 2019

gitpeterwind commented Sep 4, 2019

douglowe commented Sep 4, 2019

avaldebe commented Sep 4, 2019

gitpeterwind commented Sep 4, 2019

douglowe commented Sep 4, 2019

gitpeterwind commented Sep 26, 2019

douglowe commented Sep 26, 2019

avaldebe commented Sep 4, 2020

douglowe commented Sep 4, 2020

avaldebe commented Sep 4, 2020

mifads commented Nov 18, 2020

mvieno commented Nov 18, 2020 •

edited

Loading

douglowe commented Nov 20, 2020

douglowe commented Jul 5, 2022

gitpeterwind commented Jul 6, 2022

douglowe commented Jul 6, 2022

Maximum Runtime? #64

Maximum Runtime? #64

Comments

douglowe commented Sep 4, 2019

gitpeterwind commented Sep 4, 2019

douglowe commented Sep 4, 2019

gitpeterwind commented Sep 4, 2019

douglowe commented Sep 4, 2019

avaldebe commented Sep 4, 2019

gitpeterwind commented Sep 4, 2019

douglowe commented Sep 4, 2019

gitpeterwind commented Sep 26, 2019

douglowe commented Sep 26, 2019

avaldebe commented Sep 4, 2020

douglowe commented Sep 4, 2020

avaldebe commented Sep 4, 2020

mifads commented Nov 18, 2020

mvieno commented Nov 18, 2020 • edited Loading

douglowe commented Nov 20, 2020

douglowe commented Jul 5, 2022

gitpeterwind commented Jul 6, 2022

douglowe commented Jul 6, 2022

mvieno commented Nov 18, 2020 •

edited

Loading