Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading in large amount of reader files: memory limit #1245

Open
calquigs opened this issue Mar 6, 2024 · 4 comments
Open

Reading in large amount of reader files: memory limit #1245

calquigs opened this issue Mar 6, 2024 · 4 comments

Comments

@calquigs
Copy link
Contributor

calquigs commented Mar 6, 2024

I am working with SCHISM model files that contain a single time step each. At the moment I am reading in two months worth of files using:

data_path0 = '/<PATH>/schout_*.nc'
reader0 = reader_schism_native.Reader(data_path0,proj4='+proj=utm +zone=4 +ellps=WGS84 +datum=WGS84 +units=m +no_defs')

However that kills the run due to exceeding memory limit. Each timestep/model file is 270mb, so is creating the reader attempting to allocate 388gb of memory? Is there a better way to create the readers so it only accesses the timesteps one at a time?

@knutfrode
Copy link
Collaborator

The dataset is in this case opened with Xarray open_mfdataset:
https://github.com/OpenDrift/opendrift/blob/master/opendrift/readers/reader_schism_native.py#L113
Maybe there is some memory leak there?

In the generic reader, there are some more options provided to open_mfdataset:
https://github.com/OpenDrift/opendrift/blob/master/opendrift/readers/reader_netCDF_CF_generic.py#L100
Can you try if any of these options could solve the problem?
I do not have any SCHISM files available for testing.

@calquigs
Copy link
Contributor Author

I've tried adding those arguments and still getting the same issue. To confirm, is the intended behavior to read the files in as needed, or does the simulation need to be able to hold all the reader files in memory at once?

@calquigs
Copy link
Contributor Author

Update: reading in 2000 hourly timesteps using 'schout_*.nc' kills due to memory limit, but if I read the files in multiple readers of smaller chunks of between 100 and 1000 files (e.g. 'schout_??.nc','schout_???.nc','schout_1???.nc', the memory limit is not reached and I'm able to successfully complete a simulation! It takes 20+ minutes to read in, does that seem reasonable for this amount of data?

@knutfrode
Copy link
Collaborator

See this parallel issue: #1241 (comment)

So you could also try to install h5netcdf with conda install h5netcdf
and add engine="h5netcdf" to open_mfdataset in the SCHISM reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants