Workflow for downloading input forcing files without GPU node internet access #344

taimoorsohail · 2025-02-20T03:56:34Z

Hi all,

I am trying to run ClimaOcean on the Gadi supercomputer in Australia, and only the login CPU node has internet access on the HPC (for security reasons).

This means that I can't run examples that require downloaded files without first manually downloading the input files, placing them in the necessary folders, and then submitting a job to the GPU or CPU nodes on the HPC. This is an OK workaround, but I realised that as others run this model, they may also be using HPC environments that don't have internet access outside of the login node.

I just wanted to flag this as a potential issue, and to discuss whether it may be worth developing a workflow which avoids this need to manually download input files prior to running the model. This may be something that is unavoidable, but I figured I would flag it! Thanks

glwagner · 2025-02-20T05:34:55Z

The files go to JULIA_DEPOT_PATH. Can you set your JULIA_DEPOT_PATH to point to a place that is accessible from the GPU nodes?

As you say, I think the way to run these cases is to initiate the simulation on CPU, but at a very coarse resolution and only running for a short amount of time.

Do you think that is an acceptable workflow? Perhaps, rather than using a simulation we could develop a utility that's something like download_data(model)? That would avoid the step of having to use run!(simulation).

taimoorsohail · 2025-02-20T06:42:06Z

The issue isn't file access - the GPU nodes are able to access the depot path. The issue is that any attempt to download data (using wget, curl etc.) doesn't work because the GPU nodes don't have internet access.

Yes, the ideal workflow would be to create a standalone function that can be run on the login node (for example, I have written a bash script with wget that downloads the JRA, ECCO and Bathymetry files into the necessary folders directly from the login node). I may be overblowing the issue, but I think CPU/GPU nodes in many HPCs don't have internet access. So integrating a simple download script that can be run from the login node prior to run!(simulation) would help those running on such HPC environments.

Hope that makes sense!

glwagner · 2025-02-20T13:14:39Z

Why do you have to change the folders that the data is downloaded into?

taimoorsohail · 2025-02-20T23:39:36Z

I don't change the folders the data is downloaded into. I just manually download the data into the folders that code would ordinarily download the data into (if it had internet access).

navidcy · 2025-02-20T23:51:39Z

If I understand correctly @taimoorsohail is wrote a bash script to do what the proposed method download_data(model) would be doing, right?

navidcy · 2025-02-20T23:52:52Z

Actually minor note: the CPU nodes on HPCs often don’t have internet access either. So the issue is not GPU specific.

glwagner · 2025-02-21T00:21:21Z

If you run the same script on the login node, using CPU architecture (and coarse resolution and say changing the stop_iteration=1), does it achieve the desired effect?

glwagner · 2025-02-21T00:24:48Z

Also it'd be great to see the bash script!! I am confused why bash is better than julia, but I might be missing something. Possibly, if there is a bash script then we can simply translate the same commands into julia.

It may also be possible to hand a function all of the metadata / other objects that may be associated with data.

The challenge I think is that the data is not explicitly tied to the model. For example, we provide functionality for users to force their model with restoring to ECCO. But they need not set it up the same way every time. They could use a callback, or a forcing function. It is not rigid. So it may be hard to serve a function that is guaranteed to work. We can serve a function that makes many assumptions about a typical setup, looks for data in the typical place, etc. But at that point I am not sure we have made much progress.

A more robust strategy is to run the script that we want to run on the login node, perhaps at low resolution and for a short time. I think that should trigger all the downloads that would be needed for the simulation. It's robust because we directly use the same script that would be used for the simulation itself. It may not reuqire much more manual intervention from the user, since using a function utility is similarly difficult as changing architecture / problem size?

Curious to hear thoughts.

navidcy · 2025-02-21T01:25:49Z

Also it'd be great to see the bash script!! I am confused why bash is better than julia, but I might be missing something. Possibly, if there is a bash script then we can simply translate the same commands into julia.

Nobody claimed that the bash script is better than Julia (yet, right?).
But sure, the bash script could be translated to Julia easily (think so) and this should be a good starting point for the download_data(model) method ;)

navidcy · 2025-02-21T01:36:26Z

Hm... I see @glwagner's point. For the bathymetry, it should be straightforward to download the raw data before any regrinding etc happens to it. But yes, it's not until the users construct a coupled model with atmosphere the simulation has all the available information, right?

Perhaps providing a keyword to the appropriate methods to use data from /local/directory/this/and/that/ is more robust. But also if we can have specific methods like download_raw_bathymetry_ETOPO(), download_raw_JRA55_RYF(), download_raw_ECCO()? Does this make sense?

taimoorsohail · 2025-02-21T02:01:09Z

If you run the same script on the login node, using CPU architecture (and coarse resolution and say changing the stop_iteration=1), does it achieve the desired effect?

I agree with @glwagner that it makes sense to just run the same script with a coarse grid to download the necessary data. The issue, however, is that the login node has additional storage and walltime constraints. So, if I run run!(simulation) on the login node to download the necessary files, it kills the job after 15 minutes (the default walltime limit on this HPC) - as a result, I download the ETOPO and ECCO data but not the raw JRA55 data. The bash script doesn't seem to have this constraint when using wget. Not sure what the reason is...

Github isn't allowing uploading bash scripts so I'll just paste it below.


# Array of file URLs and their corresponding names
declare -A files=(
    ["RYF.rsds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/z6fkvmd9oe3ycmaxta131/RYF.rsds.1990_1991.nc?rlkey=r7q6zcbj6a4fxsq0f8th7c4tc&dl=1"
    ["RYF.friver.1990_1991.nc"]="https://www.dropbox.com/scl/fi/21ggl4p74k4zvbf04nb67/RYF.friver.1990_1991.nc?rlkey=ny2qcjkk1cfijmwyqxsfm68fz&dl=1"
    ["RYF.prra.1990_1991.nc"]="https://www.dropbox.com/scl/fi/5icl1gbd7f5hvyn656kjq/RYF.prra.1990_1991.nc?rlkey=iifyjm4ppwyd8ztcek4dtx0k8&dl=1"
    ["RYF.prsn.1990_1991.nc"]="https://www.dropbox.com/scl/fi/1r4ajjzb3643z93ads4x4/RYF.prsn.1990_1991.nc?rlkey=auyqpwn060cvy4w01a2yskfah&dl=1"
    ["RYF.licalvf.1990_1991.nc"]="https://www.dropbox.com/scl/fi/44nc5y27ohvif7lkvpyv0/RYF.licalvf.1990_1991.nc?rlkey=w7rqu48y2baw1efmgrnmym0jk&dl=1"
    ["RYF.huss.1990_1991.nc"]="https://www.dropbox.com/scl/fi/66z6ymfr4ghkynizydc29/RYF.huss.1990_1991.nc?rlkey=107yq04aew8lrmfyorj68v4td&dl=1"
    ["RYF.psl.1990_1991.nc"]="https://www.dropbox.com/scl/fi/0fk332027oru1iiseykgp/RYF.psl.1990_1991.nc?rlkey=4xpr9uah741483aukok6d7ctt&dl=1"
    ["RYF.rhuss.1990_1991.nc"]="https://www.dropbox.com/scl/fi/1agwsp0lzvntuyf8bm9la/RYF.rhuss.1990_1991.nc?rlkey=8cd0vs7iy1rw58b9pc9t68gtz&dl=1"
    ["RYF.rlds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/y6r62szkirrivua5nqq61/RYF.rlds.1990_1991.nc?rlkey=wt9yq3cyrvs2rbowoirf4nkum&dl=1"
    ["RYF.rsds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/z6fkvmd9oe3ycmaxta131/RYF.rsds.1990_1991.nc?rlkey=r7q6zcbj6a4fxsq0f8th7c4tc&dl=1"
    ["RYF.tas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/fpl0npwi476w635g6lke9/RYF.tas.1990_1991.nc?rlkey=0skb9pe6lgbfbiaoybe7m945s&dl=1"
    ["RYF.uas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/86wetpqla2x97isp8092g/RYF.uas.1990_1991.nc?rlkey=rcaf18sh1yz0v9g4hjm1249j0&dl=1"
    ["RYF.vas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/d38sflo9ddljstd5jwgml/RYF.vas.1990_1991.nc?rlkey=f9y3e57kx8xrb40gbstarf0x6&dl=1"
)

for file in "${!files[@]}"; do
    echo "Downloading $file..."
    wget -O "$file" "${files[$file]}"
done

echo "All files downloaded!"

glwagner · 2025-02-21T03:52:05Z

Always prefer pasting scripts rather than links

glwagner · 2025-02-21T03:53:17Z

I think ETOPO and ECCO are downloaded by regrid_bathymetry! and set! respectively?

glwagner · 2025-02-21T03:54:29Z

PS I think this issue should be brought up in parallel with the admin of the HPC center. Internet on nodes is likely not changeable, but login constraints probably are right? 15 min is too short to download large files. From the info provided, it sounds like the HPC actually somehow requires one to use bash. I don't think using different julia functions will solve the 15 min issue?

glwagner · 2025-02-21T03:57:49Z

Ok, so it looks like to me that one does not have to get to run! to download the JRA55 data. We only have to build the prescribed atmosphere since

ClimaOcean.jl/src/DataWrangling/JRA55.jl

Line 416 in 459d76d

isfile(filepath) || download(url, filepath; progress=download_progress)

since

ClimaOcean.jl/src/DataWrangling/JRA55.jl

Lines 674 to 682 in 459d76d

    
           ua  = JRA55_field_time_series(:eastward_velocity;               kw...) 
        
           va  = JRA55_field_time_series(:northward_velocity;              kw...) 
        
           Ta  = JRA55_field_time_series(:temperature;                     kw...) 
        
           qa  = JRA55_field_time_series(:specific_humidity;               kw...) 
        
           pa  = JRA55_field_time_series(:sea_level_pressure;              kw...) 
        
           Fra = JRA55_field_time_series(:rain_freshwater_flux;            kw...) 
        
           Fsn = JRA55_field_time_series(:snow_freshwater_flux;            kw...) 
        
           Ql  = JRA55_field_time_series(:downwelling_longwave_radiation;  kw...) 
        
           Qs  = JRA55_field_time_series(:downwelling_shortwave_radiation; kw...)

are all within JRA55PrescribedAtmosphere, then I think we just need

JRA55PrescribedAtmosphere(arch, time_indices; kw...)

somewhere.

As far as I can tell, it does not matter that time_indices are. The dataset cannot be divided prior to downloading, so the entire year from 1993-1994 is downloaded no matter what.

glwagner · 2025-02-21T04:01:13Z

As for ETOPO1 data, we currently call this within regrid_bathymetry:

ClimaOcean.jl/src/Bathymetry.jl

Lines 104 to 106 in 459d76d

    
           if !isfile(filepath) 
        
               Downloads.download(fileurl, filepath; progress=download_progress) 
        
           end

so we just need to isolate this in a function like

function download_bathymetry_data(
    url = "https://www.ngdc.noaa.gov/thredds/fileServer/global/ETOPO2022/60s/60s_surface_elev_netcdf",
    filename = "ETOPO_2022_v1_60s_N90W180_surface.nc",
    progress = download_progress)

    filepath = joinpath(dir, filename)
    fileurl  = url * "/" * filename # joinpath on windows creates the wrong url
    Downloads.download(fileurl, filepath; progress=download_progress)
    return nothing
end

then users can call

using ClimaOcean.Bathymetry: download_bathymetry_data
download_bathymetry_data()

to download it.

glwagner · 2025-02-21T04:02:07Z

Finally for ECCOMetadata we have this function:

ClimaOcean.jl/src/DataWrangling/ECCO/ECCO_metadata.jl

Line 234 in 459d76d

function download_dataset(metadata::ECCOMetadata; url = urls(metadata))

navidcy added user interface When humans and machines miscommunicate data wrangling We must feed the models so they don't get cranky labels Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow for downloading input forcing files without GPU node internet access #344

Workflow for downloading input forcing files without GPU node internet access #344

taimoorsohail commented Feb 20, 2025

glwagner commented Feb 20, 2025

taimoorsohail commented Feb 20, 2025

glwagner commented Feb 20, 2025

taimoorsohail commented Feb 20, 2025

navidcy commented Feb 20, 2025 •

edited

Loading

navidcy commented Feb 20, 2025 •

edited

Loading

glwagner commented Feb 21, 2025 •

edited

Loading

glwagner commented Feb 21, 2025 •

edited

Loading

navidcy commented Feb 21, 2025

navidcy commented Feb 21, 2025

taimoorsohail commented Feb 21, 2025 •

edited

Loading

glwagner commented Feb 21, 2025

glwagner commented Feb 21, 2025

glwagner commented Feb 21, 2025 •

edited

Loading

glwagner commented Feb 21, 2025

glwagner commented Feb 21, 2025

glwagner commented Feb 21, 2025

Workflow for downloading input forcing files without GPU node internet access #344

Workflow for downloading input forcing files without GPU node internet access #344

Comments

taimoorsohail commented Feb 20, 2025

glwagner commented Feb 20, 2025

taimoorsohail commented Feb 20, 2025

glwagner commented Feb 20, 2025

taimoorsohail commented Feb 20, 2025

navidcy commented Feb 20, 2025 • edited Loading

navidcy commented Feb 20, 2025 • edited Loading

glwagner commented Feb 21, 2025 • edited Loading

glwagner commented Feb 21, 2025 • edited Loading

navidcy commented Feb 21, 2025

navidcy commented Feb 21, 2025

taimoorsohail commented Feb 21, 2025 • edited Loading

glwagner commented Feb 21, 2025

glwagner commented Feb 21, 2025

glwagner commented Feb 21, 2025 • edited Loading

glwagner commented Feb 21, 2025

glwagner commented Feb 21, 2025

glwagner commented Feb 21, 2025

navidcy commented Feb 20, 2025 •

edited

Loading

navidcy commented Feb 20, 2025 •

edited

Loading

glwagner commented Feb 21, 2025 •

edited

Loading

glwagner commented Feb 21, 2025 •

edited

Loading

taimoorsohail commented Feb 21, 2025 •

edited

Loading

glwagner commented Feb 21, 2025 •

edited

Loading