Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum Runtime? #64

Open
douglowe opened this issue Sep 4, 2019 · 18 comments
Open

Maximum Runtime? #64

douglowe opened this issue Sep 4, 2019 · 18 comments
Labels

Comments

@douglowe
Copy link

douglowe commented Sep 4, 2019

I'm trying to run EMEP for a whole year - however I'm finding that my simulations silently fail after exactly 150 days (or 3600 hours), regardless of what my start date is. I can't find any obvious namelist option to control this, could you tell me if there is any way to change this behaviour, or should I run the model individually for each month instead, to avoid this problem?

@gitpeterwind
Copy link
Member

We do not have any parameter that directly control how much time has passed. Could it be that some other limitation of your system is reached (cpu time or disk space)?

@douglowe
Copy link
Author

douglowe commented Sep 4, 2019

Okay - I'm not missing an obvious namelist option then.

Disk space we're fine for, and the (relative) point at which the simulation stops is exactly the same each time I've tested this (and I've used more cpu time for other simulations). I don't think it is anything external to EMEP, especially as there's no error messages thrown when it stops. This is all I get in the log file:
current date and time: 2017-10-28 17:20:00
current date and time: 2017-10-28 17:40:00
Nest:write data domain1_output/EMEP_OUT_20171028.nc
current date and time: 2017-10-28 18:00:00
Warning: not reading all levels 4 1 SMOIS
Warning: not reading all levels 4 3 SMOIS
current date and time: 2017-10-28 18:20:00
current date and time: 2017-10-28 18:40:00
current date and time: 2017-10-28 19:00:00
current date and time: 2017-10-28 19:20:00
current date and time: 2017-10-28 19:40:00
current date and time: 2017-10-28 20:00:00
current date and time: 2017-10-28 20:20:00
current date and time: 2017-10-28 20:40:00
Nest:write data domain1_output/EMEP_OUT_20171028.nc
current date and time: 2017-10-28 21:00:00
MetRd:reading ../wrf_meteo/EMEP_grid/wrfout_d01_2017-10-29_00
Warning: not reading all levels 4 1 SMOIS
Warning: not reading all levels 4 3 SMOIS

Are there any namelist debug flags would you recommend I switch on, to see if I can get more diagnostics for this?

@gitpeterwind
Copy link
Member

It is not easy to test when this is happening so far out in the run. Maybe @avaldebe has some ideas?
It may be useful to compile with traceback options, so that we get the position of the code that is failing. For Intel we use those extensive flags:
-check all -check noarg_temp_created -debug-parameters all -traceback -ftrapuv -g -fpe0 -fp-stack-check
(PS: I will anyway not have time to spend time on this before October)

@douglowe
Copy link
Author

douglowe commented Sep 4, 2019

Sure thing - I'll try compiling with traceback flags, and will let you know if this throws up anything (and will work around the limitation for the moment).

@avaldebe
Copy link
Collaborator

avaldebe commented Sep 4, 2019

Hi @douglowe

Please take into account that the debugging flags will make the code run considerably slower. If you problem is CPU time, you should crash earlier on the simulation.

@gitpeterwind
Copy link
Member

(that is why I did not include the -O0 flag ! That should still leave the traceback)

@douglowe
Copy link
Author

douglowe commented Sep 4, 2019

sure thing - I'll take the change in compute time when I'm checking the results.

@gitpeterwind
Copy link
Member

Did you find out what went wrong?

@douglowe
Copy link
Author

Not yet, sorry - I've been busy with other projects since raising this issue. I should have time in October to investigate, and will let you know if I find anything.

@avaldebe
Copy link
Collaborator

avaldebe commented Sep 4, 2020

@douglowe

Did you find the problem?

@douglowe
Copy link
Author

douglowe commented Sep 4, 2020

I didn't, sorry. Decided in the end to run EMEP in 2 month chunks (makes more sense operationally anyway, as we can then parallise the work more).

@avaldebe
Copy link
Collaborator

avaldebe commented Sep 4, 2020

I used to do something similar on an older version of the model. The data assimilation modules were not as well tuned as they are on our current development versions and a whole year run did not fit on the HPC queue.

At the time, I had to take care to run the chunks sequentially and create a restart file for the next chunk. Otherwise PM could be too low at the begging of each chunk.

@mifads
Copy link
Contributor

mifads commented Nov 18, 2020

I just saw this. Very strange problem that I have never seen or heard of before. @mvieno often runs WRF+EMEP, I think for a year at a time?

@mvieno
Copy link

mvieno commented Nov 18, 2020

I just saw this. Very strange problem that I have never seen or heard of before. @mvieno often runs WRF+EMEP, I think for a year at a time?

Yes, I routinely run EMEP-WRF for a year. This for any domains, form global to regional. But I use a single EMEP_OUT for the full year. The only time I had a similar issue in my case was related to the NetCDF compiled without the large file feature switched on.

@douglowe
Copy link
Author

Ahh - I had not checked the NetCDF library compilation settings. I'll have a look at these, see if that might have been an issue.

@douglowe
Copy link
Author

douglowe commented Jul 5, 2022

Revisiting this question - I'm now finding that my EMEP simulations for one domain fail after ~50 days. Previously I would run this domain for 2 months (+ 7 days spin-up). However, I changed the WRF output files I use to drive EMEP to be hourly, not 3 hourly, and now I always hit the memory limit (~192Gb) for the HPC node I'm running on. Looking back at the memory usage for my previous runs driven by 3-hourly data I can see that the memory usage only gets up to ~90Gb at the end of the 2 months. And polling of the memory usage during a simulation shows that it is increasing as I go through the simulation.

I'm running EMEP release 3.44, with some local fixes for reading emission sectors (https://github.com/UoMResearchIT/emep-ctm/tree/source_UoM_CSF3_gfortran). We are reading a lot of emission data (using the UK NAEI dataset), so perhaps this is a contributing factor to the size of the memory usage (and maybe why you've not encountered this problem before)? Do you have any suggestions on how I might solve this problem? Is the best answer to use restart files and just run 1 month at a time?

@gitpeterwind
Copy link
Member

Yes, this is a problem we also noticed. In early releases we opened and closed the NetCDF file for each variable; that was very slow. So we kept the file open until all variables are read. That was fast, but on some systems that caused a huge amount of memory being used when many variables (~1000) were read.
In the latest release we therefore close and reopen the file for every 200 read(!) Primitive, but efficient. (subroutine Emis_GetCdf in EmisGet_mod.f90)

@douglowe
Copy link
Author

douglowe commented Jul 6, 2022

Thanks for the suggestion - that sounds like a good solution to me. I'll have a look at the code in the latest release, and will backport this solution to my working copy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants