Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle files with different scale_factor and add_offset #212

Open
rsignell-usgs opened this issue Aug 18, 2022 · 15 comments
Open

handle files with different scale_factor and add_offset #212

rsignell-usgs opened this issue Aug 18, 2022 · 15 comments

Comments

@rsignell-usgs
Copy link
Collaborator

rsignell-usgs commented Aug 18, 2022

It would be nice to handle netcdf files with different scale_factor and add_offset.

I recently used the ECMWF API to generate a bunch of netcdf3 files (in parallel, using dask!) but unfortunately the generated files all have different scale_factor and add_offset.

Here are four files that we would like to virtually aggregate with kerchunk:

14:30 $ aws s3 ls s3://rsignellbucket1/era5_land/ --endpoint https://mghp.osn.xsede.org --no-sign-request
2022-08-18 14:28:56  192215564 conus_2019-12-01.nc
2022-08-18 14:28:58  192215560 conus_2019-12-15.nc
2022-08-18 14:28:58  192215560 conus_2019-12-29.nc
2022-08-18 14:28:59  192215560 conus_2020-01-12.nc
@rsignell-usgs
Copy link
Collaborator Author

Thinking about this some more, I'm not sure this is possible. There is no way in the NetCDF Data model (or Zarr data model) to store metadata like add_offset and scale_factor that changes on a per-chunk basis like this, right @martindurant ?

@martindurant
Copy link
Member

I believe it would require a change in zarr/numcodecs to pass context=, about where we are in the overall array, to the codec. I've wanted this kind of thing for a while (including start/stop to be passed to the storage layer), so I suppose I should just do it, rather than go around asking people if they agree it's a good idea.

@rsignell-usgs
Copy link
Collaborator Author

Go Martin Go! It certainly would be handy in this ERA5 API workflow!

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Aug 22, 2022

Just a note that I implemented this workaround to deal with a collection of packed NetCDF3 files with different scale_factor and add_offset variable attributes in each file:

After the dask worker downloads the packed NetCDF3 file, it loads that file into Xarray and computes the bitinformation using xbitinfo. The keepbits for each variable is determined based on 99% retention of real information. The dataset is then bitrounded using that information, and stored as 32bit floats in netcdf4 before pushing to s3.

The resulting bitrounded 32-bit compressed NetCDF4 files are 3 times smaller than the packed 8-bit integer NetCDF3 files!

And of course then multiZarrtoZarr works nicely.

@martindurant
Copy link
Member

Well, but you could have saved them as zarr in that case too, right? No one ever said netCDF3 was space efficient...

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Aug 22, 2022

I could have, but since there is no difference in cloud-based access between NetCDF4 and Zarr when using Python/referenceFileSystem, and NetCDF4 files are more friendly for R users, NetCDF4 seems the better choice here!

@martindurant
Copy link
Member

That is an interesting argument that maybe should be said more loudly somewhere.

  • How is the netCDF4 experience in R?
    • can you load from remote
    • can you parallel/distributed
  • has no one tried to do zarr on R? zaRRR? It is not listed in the "known implementations" over at zarr, but there is julia, JS, java...

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Aug 22, 2022

  • Netcdf: I spoke to a few R users and it seems they mostly download NetCDF files and access with RNetCDF, which they say pretty much mimics the C-library API. If the data is NetCDF that can be understood by GDAL, then there are some remote subsetting options for S3, since there are R wrappers for GDAL

  • Zarr: A few attempts: A recent response to a Zarr-in-R query

@mikejohnson51
Copy link

mikejohnson51 commented Aug 23, 2022

Hi all - @dblodgett-usgs asked me to consider this from an R perspective. For this type of task (lazy read, access, manipulation) I rely on terra and its GDAL bindings.

The TL/DR is that R doesn't have good (any?) zarr support but does tackle the above challenges pretty well for NetCDF/COG/TIF/STAC/THREDDS type data. I believe that GDAL handles the unpacking of data so the example below of aggregations would apply to that.

Below are some brief examples using the above datasets and the public NWM forcings. The workflow relies on appending the vsis3 (or vsicurl) prefixes to data URLs.

ERA5-land data

# load libraries
library(terra);library(dplyr)
# Here is the link to an s3 file appended with the GDAL vsis3 prefix
url = '/vsis3/era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc'

# Lazy Read
(d = rast(url))
#> Warning: [rast] unknown extent

At this stage we don't have data, just structure.

With it we can look at layer names, dimensions and spatial attributes:

#> class       : SpatRaster 
#> dimensions  : 721, 1440, 744  (nrow, ncol, nlyr)
#> resolution  : 0.0006944444, 0.001386963  (x, y)
#> extent      : 0, 1, 0, 1  (xmin, xmax, ymin, ymax)
#> coord. ref. :  
#> source      : air_pressure_at_mean_sea_level.nc 
#> names       : air_p~vel_1, air_p~vel_2, air_p~vel_3, air_p~vel_4, air_p~vel_5, air_p~vel_6, ...

# Look at layer count and names
nlyr(d)
#> [1] 744

# Look at "third" diminsion
(head(names(d)))
#> [1] "air_pressure_at_mean_sea_level_1" "air_pressure_at_mean_sea_level_2"
#> [3] "air_pressure_at_mean_sea_level_3" "air_pressure_at_mean_sea_level_4"
#> [5] "air_pressure_at_mean_sea_level_5" "air_pressure_at_mean_sea_level_6"

# Look at spatial metadata
ext(d)
#> SpatExtent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
crs(d)
#> [1] ""

Slicing through "time" by layer number or name

Data is only fetched on the call to plot

plot(d[[3]])

With no spatial information we can only subset on the unit grid

Here data is only fetched for the lower left quadrent on the call to crop

cr = crop(d[[50]], ext(0,.5,0,.5)) 

#> class       : SpatRaster 
#> dimensions  : 361, 720, 1  (nrow, ncol, nlyr)
#> resolution  : 0.0006944444, 0.001386963  (x, y)
#> extent      : 0, 0.5, 0, 0.5006935  (xmin, xmax, ymin, ymax)
#> coord. ref. :  
#> source      : memory 
#> name        : air_pressure_at_mean_sea_level_50 
#> min value   :                          96877.31 
#> max value   :                         103217.8

plot(cr)

We can add the spatial metadata if known:

If we do so, then we can subset across multiple dimensions - say for the state of California at interval "40"

ext(d) = c(-180,180,-90,90)
crs(d) = "EPSG:4326"

AOI = AOI::aoi_get(state = "CA") %>% 
  vect() %>% 
  project(crs(d))

plot(crop(d[[40]], AOI))
plot(AOI, add = TRUE)

Aggregating Files

Lets say we have 2 files for January and February of 2020 that we want to read as a collections. We can define those urls, and treat them in aggregate.

Remembering that the first of those had 744 layers above, we will read them together and plot layers 744-745:

url = '/vsis3/era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc'
url2 = '/vsis3/era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc'

xx = rast(c(url, url2))
#> Warning: [rast] unknown extent
#> unknown extent

nlyr(xx)
#> [1] 1440

plot(xx[[744:745]])

Created on 2022-08-22 by the reprex package (v2.0.1)

NWM forcing example:

Start with a url endpoint and open a connection:

## URL
u = '/vsis3/noaa-nwm-retrospective-2-1-pds/forcing/1979/197902010000.LDASIN_DOMAIN1'

## Lazy Read
d = rast(u)
#> Warning: [rast] unknown extent

Explore layer slices and spatial properties

nlyr(d)
#> [1] 9
(names(d))
#> [1] "LQFRAC"   "LWDOWN"   "PSFC"     "Q2D"      "RAINRATE" "SWDOWN"   "T2D"     
#> [8] "U2D"      "V2D"
ext(d)
#> SpatExtent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
crs(d)
#> [1] ""

Assign missing spatial metadata:

ext(d) = ext(
  -2303999.62876143,
  2304000.37123857,
  -1920000.70008381,
  1919999.29991619
)
crs(d) = 'PROJCS["Sphere_Lambert_Conformal_Conic",GEOGCS["GCS_Sphere",DATUM["D_Sphere",SPHEROID["Sphere",6370000.0,0.0]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic"],PARAMETER["false_easting",0.0],PARAMETER["false_northing",0.0],PARAMETER["central_meridian",-97.0],PARAMETER["standard_parallel_1",30.0],PARAMETER["standard_parallel_2",60.0],PARAMETER["latitude_of_origin",40.000008],UNIT["Meter",1.0]];-35691800 -29075200 126180232.640845;-100000 10000;-100000 10000;0.001;0.001;0.001;IsHighPrecision'


# Plot (fetch data)
plot(d$LWDOWN)

Thats not right!

The data is stored in reverse (topdown in GDAL) -so we can flip it and plot

plot(flip(d$LWDOWN))

plot(flip(crop(d$RAINRATE, project(AOI, crs(d)))))

Screen Shot 2022-08-22 at 7 02 49 PM

So far, we have wrapped a lot of this logic in the package opendap.catalog (name changing soon) to facilitate the discovery of data files, the automation of metadata generation, and the subsetting/aggregating of files.

We have approached this by developing a catalog (https://mikejohnson51.github.io/climateR-catalogs/catalog.json) that stores the spatiotemporal and attribute metadata of discoverable assets and then helps overcome the oddities of url defintion, applying the right spatial metadata, implementing transformations (e.g. flip) and slicing based on space and time. So far its proven to be a very useful pattern for our team.

If anyone would like to talk about this kind of stuff I would love it!

Thanks,

Mike

Created on 2022-08-22 by the reprex package (v2.0.1)

@rsignell-usgs
Copy link
Collaborator Author

@mikejohnson51 this is awesome! Definitely should be a blog post somewhere!

@rsignell-usgs
Copy link
Collaborator Author

I showed this to a colleague and he mentioned https://r-spatial.github.io/stars/ as the "xarray" of R. @mikejohnson51 are you familiar with that one?

@dblodgett-usgs
Copy link

@rsignell-usgs -- I've been contributing some stuff to stars to help automated reading of CF-compliant sources. (see https://r-spatial.github.io/stars/reference/read_ncdf.html and https://r-spatial.github.io/stars/articles/stars8.html) Meanwhile, the GDAL multi dimension stuff has come along and my contributions are probably not really that needed anymore since stars is meant to primarily be a wrapper over GDAL.

@ivirshup
Copy link
Contributor

ivirshup commented Sep 1, 2022

has no one tried to do zarr on R? zaRRR? It is not listed in the "known implementations" over at zarr, but there is julia, JS, java...

I believe bioconductor is doing some work on zarr support in R for OME-NGFF microscopy data. @MSanKeys963 may know more on how that's going.

@martindurant
Copy link
Member

An alternative way to achieve this occurred to me. If we allow for preffs's model of multiple references per key, to be read and concatenated on load, then we can add any absolute value into the references set:

{
   "variable/0.1.0": [b'\x00\x00\x80?\x00\x00\x00\x00', [url, offset, size]],
   ...
}

for scale 1, offset 0 as float32. The tricky part is, the concatenated buffer would normally be passed to a decompressor before the scale/offset codecs - but not for the case of netCDF3 (which is uncompressed internally). So all we have to d in this case is make a modified scale/offset codec that takes its params from the buffer, something like what the JSON codec does.

For the general case where there is compression, we would need to make a compression wrapper codec that extracts the values, decompresses the rest, and passed the values back as a tuple (or remakes a combined buffer).

@rabernat , this looks a lot like your idea of including extra per-chunk information in the storage layer metadata

@alex-s-gardner
Copy link

+1 for supporting changing scale/offset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants