Skip to content

Processing time series data in Python on CHS Cloud

Rich Signell edited this page Apr 15, 2020 · 10 revisions

Time series processing on CHS Pangeo using Python

Note: CHS requires access to the DOI Network. That means you will need to be on the TIC (or get approved by CHS to use Amazon WorkSpaces, once it becomes available for general use).

  1. Sign up for access to the Pangeo JupyterHub: Fill out the Pangeo Service Request Form. For more info on the process, see the CHS Pangeo Support Page.

  2. Sign up for access to the Pangeo S3 Bucket: Fill out the Pangeo S3 Access form.

  3. After you get approved (it might take a day or two), login using your Active Directory credentials http://pangeo.chs.usgs.gov. You will get a menu asking which environment you want to run. Choose the "default environment", which already contains stglib, xarray, pandas, hvplot, and lots of other useful libraries. It will take a few minutes to spin up.

  4. On login, you get the standard Jupyter notebook interface by default. If you want to switch to the JupyterLab interface, edit the URL, replacing the trailing "tree" with "lab".

    Example Jupyter Notebook interface: http://pangeo.chs.usgs.gov/user/[email protected]/tree

    Example Jupyterlab interface: http://pangeo.chs.usgs.gov/user/[email protected]/lab

  5. Explore your JupyterHub environment. Open a terminal in either Jupyter interface and you can do bash shell commands like df -h and ls -lR to examine your environment. Anything in /home/jovyan will be persisted, and you can use git or the direct upload feature in JupyterHub (drag and drop) to bring code and small datasets into the environment. Some repos that work with the CHS Pangeo default environment are at: https://code.chs.usgs.gov/earthmap/notebooks, including the Pangeo Tutorial. Feel free to clone them to your directory.

  6. Working with Cloud object storage buckets (S3). The Pangeo data bucket is at s3://chs-pangeo-data-bucket and you have read/write access under your username. Before you can use the bucket you need to open a terminal, install the aws cli and run aws configure supplying the amazon public and secret keys given you by CHS. After this is done, you should be able to write data with code like this:

    import fsspec
    import xarray as xr
    
    infile = fsspec.open("s3://anaconda-public-datasets/iris/iris.csv", mode='rt', anon=True)
    with infile as f:
        df = pd.read_csv(f)
    
    outfile = fsspec.open("s3://chs-pangeo-data-bucket/rsignell/testing/iris.csv", mode='wt', profile='default')
    
    with outfile as f:
        df.to_csv(f)

    You can also move data from the command line using the AWS command line interface (CLI) from the terminal, either on the Pangeo JupyterHub, or your local workstation (as long as you are on the DOI network).

    For example, this should work:

    aws s3 cp 2561-A.nc s3://chs-pangeo-data-bucket/rsignell/ncfiles/2561-A.nc --profile default