scan_grib unable to process spectra data #540

gpickering-WEL · 2025-01-29T08:24:47Z

Hi there! I've been processing 2D wave spectra grib files using kerchunk and have found that it doesn't correctly identify the coordinates, nor the dimensions of the data . The variable d2fd has the dimensions (directionNumber, frequencyNumber, latitude, longitude), so a 2D array per lat/lon point. Xarray and cfgrib read the data format correctly, but kerchunk does not. See the below,

cfgrib

xr.load_dataset("ec_spectra/T1P12180000010106001", engine="cfgrib")

<xarray.Dataset> Size: 3MB
Dimensions:          (directionNumber: 36, frequencyNumber: 29, latitude: 21,
                      longitude: 31)
Coordinates:
    number           int64 8B 0
    time             datetime64[ns] 8B 2024-12-18
    step             timedelta64[ns] 8B 14 days 06:00:00
    meanSea          float64 8B 0.0
  * directionNumber  (directionNumber) int64 288B 1 2 3 4 5 6 ... 32 33 34 35 36
  * frequencyNumber  (frequencyNumber) int64 232B 1 2 3 4 5 6 ... 25 26 27 28 29
  * latitude         (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.9 -40.0
  * longitude        (longitude) float64 248B 141.0 141.1 141.2 ... 143.9 144.0
    valid_time       datetime64[ns] 8B 2025-01-01T06:00:00
Data variables:
    d2fd             (directionNumber, frequencyNumber, latitude, longitude) float32 3MB ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2025-01-23T14:57 GRIB to CDM+CF via cfgrib-0.9.1...

kerchunk

single_ref = scan_grib("./ec_spectra/T1P12180000010100001")
print(f"Lenght of single_ref: {len(single_ref)}")
with open("single_ref.json", "w") as f:
    json.dump(single_ref[0], f)

fs = fsspec.filesystem("reference", fo="single_ref.json")
xr.open_dataset(fs.get_mapper(""), engine="zarr")

Lenght of single_ref: 1
/home/gpickering/projects/welvops-Py/.venv/lib/python3.10/site-packages/xarray/backends/api.py:571: RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  backend_ds = backend.open_dataset(
<xarray.Dataset> Size: 6kB
Dimensions:     (latitude: 21, longitude: 31)
Coordinates:
  * latitude    (latitude) float64 168B -38.0 -38.1 -38.2 ... -39.8 -39.9 -40.0
  * longitude   (longitude) float64 248B 141.0 141.1 141.2 ... 143.8 143.9 144.0
    meanSea     float64 8B ...
    number      int64 8B ...
    step        timedelta64[ns] 8B ...
    time        datetime64[ns] 8B ...
    valid_time  datetime64[ns] 8B ...
Data variables:
    d2fd        (latitude, longitude) float64 5kB ...
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_edition:            1
    GRIB_subCentre:          0
    institution:             European Centre for Medium-Range Weather Forecasts

I've been trying to understand the problem but unfortunately haven't reached a solution, here's a few things I've noticed though.

cfgrib has a specific variable for spectra keys (directionNumber and frequencyNumber). Kerchunk uses cfgrib.dataset.COORD_ATTRS in a few places to loop over possible coordinates. This doesn't contain either directionNumber, nor frequencyNumber.
_split_file doesn't break the grib file into its component messages (I think it's meant to?). The example file I've been testing has 1044 messages. I copied the ECMWF example on reading messages from a byte stream to create a new _split_file but that wasn't a complete fix.

$ grib_ls ./notebooks/ec_spectra/T1P12180000010100001 | tail -n 10
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1            ecmf         meanSea      0            20241218     336          fc           2dfd         grid_simple  regular_ll  
1044 of 1044 messages in ./notebooks/ec_spectra/T1P12180000010100001

1044 of 1044 total messages in 1 files

def _split_file2(f: io.FileIO, skip=0):
    data = f.read()
    offset = 0
    part = 0

    while offset < len(data):
        h = codes_new_from_message(data[offset:])
        total_length = codes_get(h, "totalLength")
        yield offset, total_length, data[offset : offset + total_length]
        codes_release(h)
        offset += total_length
        part += 1
        if skip and part >= skip:
            break

Any suggestions would be greatly appreciated!

martindurant · 2025-01-29T16:14:04Z

_split_file doesn't break the grib file into its component messages (I think it's meant to?). The example file I've been testing has 1044 messages. I copied the ECMWF example on reading messages from a byte stream to create a new _split_file but that wasn't a complete fix.

This seems to be the most critical thing and we should try to understand what's going wrong. In what way was your alternative incomplete? It does mean streaming all the bytes rather than jumping to the next message, but that make no practical difference.

dcherian · 2025-01-29T17:18:33Z

Kerchunk uses cfgrib.dataset.COORD_ATTRS in a few places to loop over possible coordinates. This doesn't contain either directionNumber, nor frequencyNumber.

This could be updated. AFAICT there's no obvious way to figure out what the related "coordinate" vars are, so we'll need some sort of hardcoding here.

gpickering-WEL · 2025-01-30T08:23:09Z

Thanks both for the quick response. @martindurant, swapping out _split_file in this way then generated a kerchunk reference for each grib message, but the co-ordinates, dimensions and data were still incorrect.

@dcherian yes it's not obvious how the coordinate variables are constructed. I've spent some time today understanding the cfgrib implementation but haven't reached a point where I could do something similar within kerchunk. Below is my current understanding of how cfgrib generates coordinates and data dimensions.

index_keys are established which are a list of 22 hardcoded attributes and dimensions (INDEX_KEYS). Extra coordinates can also be supplied and are added to the set.
messages.FileIndex.from_indexpath_or_filestream called with the cfgrib.messages.FileStream and index keys.
If the index file doesn't exist (first file open) or doesn't work (for some reason the index files generated never work, they fail on this getattr) then from_fieldset is called.
This just creates an iterator representing each message in the filestream, then calls from_fieldset_and_iteritems
I think this is where the important stuff happens. The outer loop is over the messages, while the inner loop is over the index_keys (those 22 possible attributes and dimensions). They then check for their presence within the message and store them if they exist. This results in a field_ids_index which has as its keys the attributes/dimensions and as its values the message offset (byte position of the start of the message).

This is about as far as I've gotten, I haven't gotten around to looking in detail at how the dataset is opened from this index, though there's the combination of header_dimensions and geo_dimensions. Geo dimensions are the usual lat, lon, while header dimensions are our missing frequencyNumber and directionNumber. See this section which starts with the header dimensions, then does the geo dimensions, then combines them. I think currently kerchunk only checks for lat and lon with m["Ny"] and m["Nx"], and additionally level after.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan_grib unable to process spectra data #540

scan_grib unable to process spectra data #540

gpickering-WEL commented Jan 29, 2025

martindurant commented Jan 29, 2025

dcherian commented Jan 29, 2025

gpickering-WEL commented Jan 30, 2025

scan_grib unable to process spectra data #540

scan_grib unable to process spectra data #540

Comments

gpickering-WEL commented Jan 29, 2025

martindurant commented Jan 29, 2025

dcherian commented Jan 29, 2025

gpickering-WEL commented Jan 30, 2025