Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow for downloading input forcing files without GPU node internet access #344

Open
taimoorsohail opened this issue Feb 20, 2025 · 17 comments
Labels
data wrangling We must feed the models so they don't get cranky user interface When humans and machines miscommunicate

Comments

@taimoorsohail
Copy link
Contributor

Hi all,

I am trying to run ClimaOcean on the Gadi supercomputer in Australia, and only the login CPU node has internet access on the HPC (for security reasons).

This means that I can't run examples that require downloaded files without first manually downloading the input files, placing them in the necessary folders, and then submitting a job to the GPU or CPU nodes on the HPC. This is an OK workaround, but I realised that as others run this model, they may also be using HPC environments that don't have internet access outside of the login node.

I just wanted to flag this as a potential issue, and to discuss whether it may be worth developing a workflow which avoids this need to manually download input files prior to running the model. This may be something that is unavoidable, but I figured I would flag it! Thanks

@glwagner
Copy link
Member

The files go to JULIA_DEPOT_PATH. Can you set your JULIA_DEPOT_PATH to point to a place that is accessible from the GPU nodes?

As you say, I think the way to run these cases is to initiate the simulation on CPU, but at a very coarse resolution and only running for a short amount of time.

Do you think that is an acceptable workflow? Perhaps, rather than using a simulation we could develop a utility that's something like download_data(model)? That would avoid the step of having to use run!(simulation).

@taimoorsohail
Copy link
Contributor Author

The issue isn't file access - the GPU nodes are able to access the depot path. The issue is that any attempt to download data (using wget, curl etc.) doesn't work because the GPU nodes don't have internet access.

Yes, the ideal workflow would be to create a standalone function that can be run on the login node (for example, I have written a bash script with wget that downloads the JRA, ECCO and Bathymetry files into the necessary folders directly from the login node). I may be overblowing the issue, but I think CPU/GPU nodes in many HPCs don't have internet access. So integrating a simple download script that can be run from the login node prior to run!(simulation) would help those running on such HPC environments.

Hope that makes sense!

@glwagner
Copy link
Member

Why do you have to change the folders that the data is downloaded into?

@taimoorsohail
Copy link
Contributor Author

I don't change the folders the data is downloaded into. I just manually download the data into the folders that code would ordinarily download the data into (if it had internet access).

@navidcy
Copy link
Collaborator

navidcy commented Feb 20, 2025

If I understand correctly @taimoorsohail is wrote a bash script to do what the proposed method download_data(model) would be doing, right?

@navidcy
Copy link
Collaborator

navidcy commented Feb 20, 2025

Actually minor note: the CPU nodes on HPCs often don’t have internet access either. So the issue is not GPU specific.

@navidcy navidcy added user interface When humans and machines miscommunicate data wrangling We must feed the models so they don't get cranky labels Feb 20, 2025
@glwagner
Copy link
Member

glwagner commented Feb 21, 2025

If you run the same script on the login node, using CPU architecture (and coarse resolution and say changing the stop_iteration=1), does it achieve the desired effect?

@glwagner
Copy link
Member

glwagner commented Feb 21, 2025

Also it'd be great to see the bash script!! I am confused why bash is better than julia, but I might be missing something. Possibly, if there is a bash script then we can simply translate the same commands into julia.

It may also be possible to hand a function all of the metadata / other objects that may be associated with data.

The challenge I think is that the data is not explicitly tied to the model. For example, we provide functionality for users to force their model with restoring to ECCO. But they need not set it up the same way every time. They could use a callback, or a forcing function. It is not rigid. So it may be hard to serve a function that is guaranteed to work. We can serve a function that makes many assumptions about a typical setup, looks for data in the typical place, etc. But at that point I am not sure we have made much progress.

A more robust strategy is to run the script that we want to run on the login node, perhaps at low resolution and for a short time. I think that should trigger all the downloads that would be needed for the simulation. It's robust because we directly use the same script that would be used for the simulation itself. It may not reuqire much more manual intervention from the user, since using a function utility is similarly difficult as changing architecture / problem size?

Curious to hear thoughts.

@navidcy
Copy link
Collaborator

navidcy commented Feb 21, 2025

Also it'd be great to see the bash script!! I am confused why bash is better than julia, but I might be missing something. Possibly, if there is a bash script then we can simply translate the same commands into julia.

Nobody claimed that the bash script is better than Julia (yet, right?).
But sure, the bash script could be translated to Julia easily (think so) and this should be a good starting point for the download_data(model) method ;)

@navidcy
Copy link
Collaborator

navidcy commented Feb 21, 2025

Hm... I see @glwagner's point. For the bathymetry, it should be straightforward to download the raw data before any regrinding etc happens to it. But yes, it's not until the users construct a coupled model with atmosphere the simulation has all the available information, right?

Perhaps providing a keyword to the appropriate methods to use data from /local/directory/this/and/that/ is more robust. But also if we can have specific methods like download_raw_bathymetry_ETOPO(), download_raw_JRA55_RYF(), download_raw_ECCO()? Does this make sense?

@taimoorsohail
Copy link
Contributor Author

taimoorsohail commented Feb 21, 2025

If you run the same script on the login node, using CPU architecture (and coarse resolution and say changing the stop_iteration=1), does it achieve the desired effect?

I agree with @glwagner that it makes sense to just run the same script with a coarse grid to download the necessary data. The issue, however, is that the login node has additional storage and walltime constraints. So, if I run run!(simulation) on the login node to download the necessary files, it kills the job after 15 minutes (the default walltime limit on this HPC) - as a result, I download the ETOPO and ECCO data but not the raw JRA55 data. The bash script doesn't seem to have this constraint when using wget. Not sure what the reason is...

Github isn't allowing uploading bash scripts so I'll just paste it below.


# Array of file URLs and their corresponding names
declare -A files=(
    ["RYF.rsds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/z6fkvmd9oe3ycmaxta131/RYF.rsds.1990_1991.nc?rlkey=r7q6zcbj6a4fxsq0f8th7c4tc&dl=1"
    ["RYF.friver.1990_1991.nc"]="https://www.dropbox.com/scl/fi/21ggl4p74k4zvbf04nb67/RYF.friver.1990_1991.nc?rlkey=ny2qcjkk1cfijmwyqxsfm68fz&dl=1"
    ["RYF.prra.1990_1991.nc"]="https://www.dropbox.com/scl/fi/5icl1gbd7f5hvyn656kjq/RYF.prra.1990_1991.nc?rlkey=iifyjm4ppwyd8ztcek4dtx0k8&dl=1"
    ["RYF.prsn.1990_1991.nc"]="https://www.dropbox.com/scl/fi/1r4ajjzb3643z93ads4x4/RYF.prsn.1990_1991.nc?rlkey=auyqpwn060cvy4w01a2yskfah&dl=1"
    ["RYF.licalvf.1990_1991.nc"]="https://www.dropbox.com/scl/fi/44nc5y27ohvif7lkvpyv0/RYF.licalvf.1990_1991.nc?rlkey=w7rqu48y2baw1efmgrnmym0jk&dl=1"
    ["RYF.huss.1990_1991.nc"]="https://www.dropbox.com/scl/fi/66z6ymfr4ghkynizydc29/RYF.huss.1990_1991.nc?rlkey=107yq04aew8lrmfyorj68v4td&dl=1"
    ["RYF.psl.1990_1991.nc"]="https://www.dropbox.com/scl/fi/0fk332027oru1iiseykgp/RYF.psl.1990_1991.nc?rlkey=4xpr9uah741483aukok6d7ctt&dl=1"
    ["RYF.rhuss.1990_1991.nc"]="https://www.dropbox.com/scl/fi/1agwsp0lzvntuyf8bm9la/RYF.rhuss.1990_1991.nc?rlkey=8cd0vs7iy1rw58b9pc9t68gtz&dl=1"
    ["RYF.rlds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/y6r62szkirrivua5nqq61/RYF.rlds.1990_1991.nc?rlkey=wt9yq3cyrvs2rbowoirf4nkum&dl=1"
    ["RYF.rsds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/z6fkvmd9oe3ycmaxta131/RYF.rsds.1990_1991.nc?rlkey=r7q6zcbj6a4fxsq0f8th7c4tc&dl=1"
    ["RYF.tas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/fpl0npwi476w635g6lke9/RYF.tas.1990_1991.nc?rlkey=0skb9pe6lgbfbiaoybe7m945s&dl=1"
    ["RYF.uas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/86wetpqla2x97isp8092g/RYF.uas.1990_1991.nc?rlkey=rcaf18sh1yz0v9g4hjm1249j0&dl=1"
    ["RYF.vas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/d38sflo9ddljstd5jwgml/RYF.vas.1990_1991.nc?rlkey=f9y3e57kx8xrb40gbstarf0x6&dl=1"
)

for file in "${!files[@]}"; do
    echo "Downloading $file..."
    wget -O "$file" "${files[$file]}"
done

echo "All files downloaded!"

@glwagner
Copy link
Member

Always prefer pasting scripts rather than links

@glwagner
Copy link
Member

I think ETOPO and ECCO are downloaded by regrid_bathymetry! and set! respectively?

@glwagner
Copy link
Member

glwagner commented Feb 21, 2025

PS I think this issue should be brought up in parallel with the admin of the HPC center. Internet on nodes is likely not changeable, but login constraints probably are right? 15 min is too short to download large files. From the info provided, it sounds like the HPC actually somehow requires one to use bash. I don't think using different julia functions will solve the 15 min issue?

@glwagner
Copy link
Member

Ok, so it looks like to me that one does not have to get to run! to download the JRA55 data. We only have to build the prescribed atmosphere since

isfile(filepath) || download(url, filepath; progress=download_progress)

since

ua = JRA55_field_time_series(:eastward_velocity; kw...)
va = JRA55_field_time_series(:northward_velocity; kw...)
Ta = JRA55_field_time_series(:temperature; kw...)
qa = JRA55_field_time_series(:specific_humidity; kw...)
pa = JRA55_field_time_series(:sea_level_pressure; kw...)
Fra = JRA55_field_time_series(:rain_freshwater_flux; kw...)
Fsn = JRA55_field_time_series(:snow_freshwater_flux; kw...)
Ql = JRA55_field_time_series(:downwelling_longwave_radiation; kw...)
Qs = JRA55_field_time_series(:downwelling_shortwave_radiation; kw...)

are all within JRA55PrescribedAtmosphere, then I think we just need

JRA55PrescribedAtmosphere(arch, time_indices; kw...)

somewhere.

As far as I can tell, it does not matter that time_indices are. The dataset cannot be divided prior to downloading, so the entire year from 1993-1994 is downloaded no matter what.

@glwagner
Copy link
Member

As for ETOPO1 data, we currently call this within regrid_bathymetry:

if !isfile(filepath)
Downloads.download(fileurl, filepath; progress=download_progress)
end

so we just need to isolate this in a function like

function download_bathymetry_data(
    url = "https://www.ngdc.noaa.gov/thredds/fileServer/global/ETOPO2022/60s/60s_surface_elev_netcdf",
    filename = "ETOPO_2022_v1_60s_N90W180_surface.nc",
    progress = download_progress)

    filepath = joinpath(dir, filename)
    fileurl  = url * "/" * filename # joinpath on windows creates the wrong url
    Downloads.download(fileurl, filepath; progress=download_progress)
    return nothing
end

then users can call

using ClimaOcean.Bathymetry: download_bathymetry_data
download_bathymetry_data()

to download it.

@glwagner
Copy link
Member

Finally for ECCOMetadata we have this function:

function download_dataset(metadata::ECCOMetadata; url = urls(metadata))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data wrangling We must feed the models so they don't get cranky user interface When humans and machines miscommunicate
Projects
None yet
Development

No branches or pull requests

3 participants