-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle files with different scale_factor and add_offset #212
Comments
Thinking about this some more, I'm not sure this is possible. There is no way in the NetCDF Data model (or Zarr data model) to store metadata like |
I believe it would require a change in zarr/numcodecs to pass |
Go Martin Go! It certainly would be handy in this ERA5 API workflow! |
Just a note that I implemented this workaround to deal with a collection of packed NetCDF3 files with different After the dask worker downloads the packed NetCDF3 file, it loads that file into Xarray and computes the bitinformation using xbitinfo. The The resulting bitrounded 32-bit compressed NetCDF4 files are 3 times smaller than the packed 8-bit integer NetCDF3 files! And of course then |
Well, but you could have saved them as zarr in that case too, right? No one ever said netCDF3 was space efficient... |
I could have, but since there is no difference in cloud-based access between NetCDF4 and Zarr when using Python/referenceFileSystem, and NetCDF4 files are more friendly for R users, NetCDF4 seems the better choice here! |
That is an interesting argument that maybe should be said more loudly somewhere.
|
|
Hi all - @dblodgett-usgs asked me to consider this from an R perspective. For this type of task (lazy read, access, manipulation) I rely on The TL/DR is that R doesn't have good (any?) zarr support but does tackle the above challenges pretty well for NetCDF/COG/TIF/STAC/THREDDS type data. I believe that GDAL handles the unpacking of data so the example below of aggregations would apply to that. Below are some brief examples using the above datasets and the public NWM forcings. The workflow relies on appending the vsis3 (or vsicurl) prefixes to data URLs. ERA5-land data# load libraries
library(terra);library(dplyr)
# Here is the link to an s3 file appended with the GDAL vsis3 prefix
url = '/vsis3/era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc'
# Lazy Read
(d = rast(url))
#> Warning: [rast] unknown extent At this stage we don't have data, just structure.With it we can look at layer names, dimensions and spatial attributes: #> class : SpatRaster
#> dimensions : 721, 1440, 744 (nrow, ncol, nlyr)
#> resolution : 0.0006944444, 0.001386963 (x, y)
#> extent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
#> coord. ref. :
#> source : air_pressure_at_mean_sea_level.nc
#> names : air_p~vel_1, air_p~vel_2, air_p~vel_3, air_p~vel_4, air_p~vel_5, air_p~vel_6, ...
# Look at layer count and names
nlyr(d)
#> [1] 744
# Look at "third" diminsion
(head(names(d)))
#> [1] "air_pressure_at_mean_sea_level_1" "air_pressure_at_mean_sea_level_2"
#> [3] "air_pressure_at_mean_sea_level_3" "air_pressure_at_mean_sea_level_4"
#> [5] "air_pressure_at_mean_sea_level_5" "air_pressure_at_mean_sea_level_6"
# Look at spatial metadata
ext(d)
#> SpatExtent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
crs(d)
#> [1] ""
Slicing through "time" by layer number or nameData is only fetched on the call to plot(d[[3]]) With no spatial information we can only subset on the unit gridHere data is only fetched for the lower left quadrent on the call to cr = crop(d[[50]], ext(0,.5,0,.5))
#> class : SpatRaster
#> dimensions : 361, 720, 1 (nrow, ncol, nlyr)
#> resolution : 0.0006944444, 0.001386963 (x, y)
#> extent : 0, 0.5, 0, 0.5006935 (xmin, xmax, ymin, ymax)
#> coord. ref. :
#> source : memory
#> name : air_pressure_at_mean_sea_level_50
#> min value : 96877.31
#> max value : 103217.8
plot(cr) We can add the spatial metadata if known:If we do so, then we can subset across multiple dimensions - say for the state of California at interval "40" ext(d) = c(-180,180,-90,90)
crs(d) = "EPSG:4326"
AOI = AOI::aoi_get(state = "CA") %>%
vect() %>%
project(crs(d))
plot(crop(d[[40]], AOI))
plot(AOI, add = TRUE) Aggregating FilesLets say we have 2 files for January and February of 2020 that we want to read as a collections. We can define those urls, and treat them in aggregate. Remembering that the first of those had 744 layers above, we will read them together and plot layers 744-745: url = '/vsis3/era5-pds/2020/01/data/air_pressure_at_mean_sea_level.nc'
url2 = '/vsis3/era5-pds/2020/02/data/air_pressure_at_mean_sea_level.nc'
xx = rast(c(url, url2))
#> Warning: [rast] unknown extent
#> unknown extent
nlyr(xx)
#> [1] 1440
plot(xx[[744:745]]) Created on 2022-08-22 by the reprex package (v2.0.1) NWM forcing example:Start with a url endpoint and open a connection: ## URL
u = '/vsis3/noaa-nwm-retrospective-2-1-pds/forcing/1979/197902010000.LDASIN_DOMAIN1'
## Lazy Read
d = rast(u)
#> Warning: [rast] unknown extent Explore layer slices and spatial propertiesnlyr(d)
#> [1] 9
(names(d))
#> [1] "LQFRAC" "LWDOWN" "PSFC" "Q2D" "RAINRATE" "SWDOWN" "T2D"
#> [8] "U2D" "V2D"
ext(d)
#> SpatExtent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
crs(d)
#> [1] "" Assign missing spatial metadata:ext(d) = ext(
-2303999.62876143,
2304000.37123857,
-1920000.70008381,
1919999.29991619
)
crs(d) = 'PROJCS["Sphere_Lambert_Conformal_Conic",GEOGCS["GCS_Sphere",DATUM["D_Sphere",SPHEROID["Sphere",6370000.0,0.0]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Lambert_Conformal_Conic"],PARAMETER["false_easting",0.0],PARAMETER["false_northing",0.0],PARAMETER["central_meridian",-97.0],PARAMETER["standard_parallel_1",30.0],PARAMETER["standard_parallel_2",60.0],PARAMETER["latitude_of_origin",40.000008],UNIT["Meter",1.0]];-35691800 -29075200 126180232.640845;-100000 10000;-100000 10000;0.001;0.001;0.001;IsHighPrecision'
# Plot (fetch data)
plot(d$LWDOWN) Thats not right!The data is stored in reverse (topdown in GDAL) -so we can flip it and plot plot(flip(d$LWDOWN)) plot(flip(crop(d$RAINRATE, project(AOI, crs(d))))) So far, we have wrapped a lot of this logic in the package We have approached this by developing a catalog (https://mikejohnson51.github.io/climateR-catalogs/catalog.json) that stores the spatiotemporal and attribute metadata of discoverable assets and then helps overcome the oddities of url defintion, applying the right spatial metadata, implementing transformations (e.g. flip) and slicing based on space and time. So far its proven to be a very useful pattern for our team. If anyone would like to talk about this kind of stuff I would love it! Thanks, Mike Created on 2022-08-22 by the reprex package (v2.0.1) |
@mikejohnson51 this is awesome! Definitely should be a blog post somewhere! |
I showed this to a colleague and he mentioned https://r-spatial.github.io/stars/ as the "xarray" of R. @mikejohnson51 are you familiar with that one? |
@rsignell-usgs -- I've been contributing some stuff to |
I believe bioconductor is doing some work on zarr support in R for OME-NGFF microscopy data. @MSanKeys963 may know more on how that's going. |
An alternative way to achieve this occurred to me. If we allow for preffs's model of multiple references per key, to be read and concatenated on load, then we can add any absolute value into the references set:
for scale 1, offset 0 as float32. The tricky part is, the concatenated buffer would normally be passed to a decompressor before the scale/offset codecs - but not for the case of netCDF3 (which is uncompressed internally). So all we have to d in this case is make a modified scale/offset codec that takes its params from the buffer, something like what the JSON codec does. For the general case where there is compression, we would need to make a compression wrapper codec that extracts the values, decompresses the rest, and passed the values back as a tuple (or remakes a combined buffer). @rabernat , this looks a lot like your idea of including extra per-chunk information in the storage layer metadata |
+1 for supporting changing scale/offset. |
It would be nice to handle netcdf files with different
scale_factor
andadd_offset
.I recently used the ECMWF API to generate a bunch of netcdf3 files (in parallel, using dask!) but unfortunately the generated files all have different
scale_factor
andadd_offset
.Here are four files that we would like to virtually aggregate with kerchunk:
The text was updated successfully, but these errors were encountered: