Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download ECCO files using Downloads and .netrc files #281

Merged
merged 33 commits into from
Dec 10, 2024

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Nov 28, 2024

This should solve #272

I have added some windows tests, let's see if they pass.

supersedes #183

@simone-silvestri
Copy link
Collaborator Author

It seems to pass on windows architectures

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Nov 28, 2024

@waywardpidgeon this should work on windows if you want to test it there

@simone-silvestri simone-silvestri changed the base branch from main to revert-to-cba1991176d2cd23b99987e5fd2ce63b194aaa31 November 28, 2024 15:39
@simone-silvestri simone-silvestri changed the base branch from revert-to-cba1991176d2cd23b99987e5fd2ce63b194aaa31 to main November 28, 2024 15:40
@simone-silvestri
Copy link
Collaborator Author

should close #179

Copy link

codecov bot commented Nov 28, 2024

Codecov Report

Attention: Patch coverage is 0% with 41 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (b73c67c) to head (5b65403).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/DataWrangling/ECCO/ECCO_metadata.jl 0.00% 19 Missing ⚠️
src/DataWrangling/DataWrangling.jl 0.00% 13 Missing ⚠️
src/distributed_utils.jl 0.00% 5 Missing ⚠️
src/Bathymetry.jl 0.00% 3 Missing ⚠️
src/DataWrangling/JRA55.jl 0.00% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main    #281   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         34      34           
  Lines       1910    1921   +11     
=====================================
- Misses      1910    1921   +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

return nothing
end

# ECCO downloader
function ECCO_downloader(username, password, dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is specific about ECCO? This function looks general

Copy link
Collaborator Author

@simone-silvestri simone-silvestri Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the netrc file writes down ECCO specific information (the machine). We could generalize it easily if we think we might need this somewhere else

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have generalized the function to accept a "machine" argument

src/DataWrangling/ECCO/ECCO_metadata.jl Outdated Show resolved Hide resolved
src/DataWrangling/ECCO/ECCO_metadata.jl Outdated Show resolved Hide resolved
src/DataWrangling/ECCO/ECCO_metadata.jl Outdated Show resolved Hide resolved
test/test_downloading.jl Outdated Show resolved Hide resolved
# Write down the username and password in a .netrc file
downloader = netrc_downloader(username, password, "ecco.jpl.nasa.gov", tmp)

@distribute for metadatum in metadata # Distribute the download among ranks if MPI is initialized
Copy link
Member

@Sbozzolo Sbozzolo Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having multiple processes download the same file to the same location can lead to race conditions. It is also not the best of the etiquettes: if you run with 1000 MPI processes, this will send 1000 requests to the ECCO servers at the very same time, which would lead to high load on their end.

Copy link
Collaborator Author

@simone-silvestri simone-silvestri Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the @distribute macro distributes the download over the ranks, so in theory we should never have different ranks downloading the same file. The second comment is right though, if we have 1000 ranks and 1000 files to download we will have 1000 requests at the same time to the ECCO servers (different files though). Will this create problems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the @distribute macro distributes the download over the ranks, so in theory we should never have different ranks downloading the same file.

Got it. Thanks

The second comment is right though, if we have 1000 ranks and 1000 files to download we will have 1000 requests at the same time to the ECCO servers (different files though). Will this create problems?

I don't think it's necessarily a problem, but it's something to be mindful about (mostly as a pattern to not abuse). If one is running on 1000s processes, it is probably best to pre-download the data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also a hack, why would we distributed among MPI processes? It might be better to instead distribute with tasks, but only from rank 0. That way the download is distributed and fast, but we limit the number of requests to something reasonable.


Create a downloader that uses a netrc file to authenticate with the given machine.
This downlader writes down a netrc file with the username and password for the given machine in the directory `dir`.
To avoid storing passwords in plain text, it is recommended to initialize the downloader in a temporary directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this avoid storing passwords in plain text?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, well it does but then the temporary directory with the netrc file is deleted when the do tmp; ...; end block is closed. So in theory for a finite amount of time the password is indeed stored in the filesystem. I can probably rewrite this comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory? I don't understand. The password is definitely stored in plain text right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not merely "theoretical", the design explicitly stores it in text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it is definitely stored in plain text

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is it a problem that .netrc is written? Isn't this the recommended way to use .netrc anyways?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the environment variables route is much cleaner, but it looks like with .netrc we are forced to write down a file. I probably wouldn't like to have a stored file with a password automatically written so I think it is nice to raise this point in the docstring.

@simone-silvestri simone-silvestri merged commit 6d3d822 into main Dec 10, 2024
20 checks passed
@simone-silvestri simone-silvestri deleted the ss/download-everywhere branch December 10, 2024 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants