-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variables missing from 'scan_grib', but findable with xarray and cfgrib #358
Comments
If it's helpful, I uploaded the grib2 file to a google drive. It's just under 10 MB in size. |
Sorry to ping again - I know I posted the issue on a Friday, so my apologies there. Just wanting to know if there are any ideas about where and how kerchunk is missing a variable in the dataset. |
I see
but kerchunk is only matching one grib message. I'm not certain what the codes in those two lines mean, but I wonder whether they are components of the SAME message - there are indeed 294 messages in the file. |
I fear I may need a GRIB expert to figure this one out. The values returned in variable u10 are indeed the same values as found by xarray/cfgrib. The offset and message size are correct, but I don't know how the other component of the vector in the same grib message is found. |
Thanks for taking a look at this, I really appreciate it! I'm by no means a grib expert, but I'm trying to track down what makes this encoding different. So far I've stumbled across some grib documentation that mentions velocity fields encoded in sub-messages... I haven't made much progress beyond this, but I'm trying to poke around more documentation with grib2 and cfgrib to see how this is handled. I'll report back if I find anything else useful. |
That does seem like it's talking about the same thing, but I don't see how this is handled in cfgrib/eccodes . (cf noritada/grib-rs#13 ) |
I may have at least narrowed down a rough idea of where in cfgrib this happens, but as far as how to activate/enable it, I'm still trying to untangle. In eccodes/cfgrib, this is referred to as a multi-field grib file. Within cfgrib/messages.py, there are some functions and conditionals that interact with |
Alright, I made some progress here. I have no idea if this is the right way to go about this, but I basically used the logic of the FileStream class in cfgrib.messages, and I can successfully print out all variables.
I'll spare dumping the whole output, but now I see v/v10 variables in the output!
So, it seems like the weird quirk of
However, if there are no objections to changing the iteration logic of scan_grib, I can modify it to follow this procedure and make a pull request. |
I am not opposed to that, except that I had hoped to only need eccodes and not also cfgrib during the access phase. Maybe that doesn't matter, so long as we can still interpret a block of bytes as a message. The GRIB codec would need to be updated to do something similar. The codec reads whole messages at a time, because we don't know how to decode the interior buffers of a GRIB (sub)message unfortunately. That would mean that we need to tell the codec which submessage we want - assuming this is consistent across all input files/messages - and would end up temporarily loading the bytes for both variables when trying to access either one. |
is interesting, because kerchunk essentially has its own index and always does random access. As above, if only the "where is the buffer" and "decode this bytes buffer" were accessible calls in the eccodes API, life would be much easier for everyone. |
If the desire is to keep things purely in terms of eccodes, then I'll double check and see how doable that is before moving forward with the cfgrib parser. Most of the calls under the hood are to eccodes, so it may be possible to rig it to work directly. This was kind of my brute-force "make it work" approach. Let me see what I can do. |
Has there been any progress on this issue? I'm running into the same problem with data in the NAM s3 bucket. I'm unable to access see the v10 data using kerchunk, but the u10 message seems okay. Eli |
@emfdavid , I don't suppose you've come across this kind of thinkg in your grib travels? |
Hi @imcslatte - apologies for the slow response. Unfortunately due to time constraints and other responsibilities, I wasn't able to revisit this problem in order to fix it using the eccodes API directly. I have a temporary fix listed above that uses the cfgrib API, which is built on top of eccodes. I might be able to find some time in the next week or two to investigate fixing this directly from eccodes, now that I've been reminded of this... as I also have some upcoming projects with RAP data where having this issue fixed will be helpful :). If you want to implement the stopgap fix locally in your environment, I'd be happy to help. |
I have not seen sub messages in the HRRR or GFS/GEFS grib2 files. |
@keltonhalbert @martindurant I was wondering if part of the problem might be that the submessage IDs are decimals and not integers? If they are typed as integer the ids 210.1 and 210.2 would be the same. Just a thought. |
@imcslatte , ooh, so
hoping this is the simple problem @martindurant ! |
I am not sure that we make any use of the ID value. I think the problem is, that the two arrays are bundled in the same grib message, and I don't know how to tell the cfgrib API "load the second sub-message". If I had a spare year, perhaps I could dig into the internals of grib to understand this, but for now I'll have to rely on people like @mpiannucci ! |
Yeah the main problem is how GRIB sub-messages get handled. Per the eccodes documentation, sub-messages are not a feature that is recommended to be used, but NCEP has been doing this for RAP and NAM products (not HRRR or GFS though) for a while . The reason it works in cfgrib but not kerchunk is because cfgrib incorporates special eccodes logic in order to account for grib sub-messages, where kerchunk does not implement said logic. As far as I can tell, the bulk of the cfgrib calls to eccodes that kerchunk needs to duplicate are contained within the messages file.
The
Of particular note is checking if Now, as far as addressing this issue, my understanding is the following:
I feel pretty confident in being able to come up with a solution, but not confident that it'll be the correct solution. I guess the only way to know for sure is to try something in a fork of the repository. So, my question becomes, do the developers/maintainers have any input before diving head first into a naive solution? |
As another potentially helpful datapoint: I tried using kerchunk on ERA5 single level grib files and encounter the same issue (downloaded from ECMWF (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels) to our network share, but still painfully slow to load, so I thought I would give kerchunk a try). Here, the grib file contains 4 wind variables, Furthermore, the file contains a I would love to help, but unfortunately, I am also not a grib expert. Also happy to open a separate issue is that helps. |
Is anyone? :) |
@TAdeJong I am currently facing the same issue where Kerchunk is not able to "detect" the 24 different hours within the file. |
@mpiannucci - have you had any chance to work with sub-messages? |
I am pretty sure that the copernicus climate store exports single level data in a grib1 file, gribberish cant even read it because it doesnt contain all the grib2 sections when scanning through, when i downloaded a 24 hour file with u100, v100, u10, v10. I may be wrong though, i havent worked with euro data much. |
I am also struggling with ECMWF data from the new CDS beta API due to a missing import fsspec
import xarray as xr
from kerchunk.grib2 import GribToZarr
DOWNLOADED_GRIB = "/tmp/downloaded.grib"
def download_grib():
"""You don't actually need to run this. Instead download the grib to the
DOWNLOADED_GRIB. location. This is included for completeness.
"""
import os
import cdsapi
# os.environ["CDSAPI_KEY"] = "xxxxxxxx"
dataset = "reanalysis-era5-single-levels"
request = {
"product_type": ["reanalysis"],
"variable": ["2m_temperature"],
"year": ["2024"],
"month": ["01"],
"day": ["01", "02"],
"time": ["00:00"],
"data_format": "grib",
"area": [90, 0, -90, 360],
}
c = cdsapi.Client(url="https://cds-beta.climate.copernicus.eu/api")
c.retrieve(dataset, request, DOWNLOADED_GRIB)
# download_grib()
ds0 = xr.open_dataset(DOWNLOADED_GRIB, engine="cfgrib")
print("Loaded with xarray:")
print(ds0)
print()
refs_for_messages = GribToZarr(DOWNLOADED_GRIB).translate()
assert len(refs_for_messages) == 1
ref = refs_for_messages[0]
fs = fsspec.filesystem("reference", fo=ref)
m = fs.get_mapper("")
ds = xr.open_zarr(m, consolidated=False).load()
print("Loaded with kerchunk:")
print(ds)
Note how for the same file when loaded with xarray, |
@mpiannucci , interestingly, gribberish doesn't parse the given file at all, but gives
This might be an interesting test case for you. I'd say there's a better chance, in the long run, of getting the sub-messages out of gribberish than eccodes (which requires the global "multi" state for reading this). |
Well, I found some time to try and give this a good, long wrestle and I have come to the conclusion that the problem exists within In short, Some progress that I made is that scan_grib can now at least recognize the presence of the v10 arrays within a multi-field message, and can be tested in my personal fork. This was only achievable by using the fsspec file handler and passing it to It appears to me that changing GRIBCodec to take a fsspec file is undesirable for multiple reasons, especially since byte streaming is kind of the whole point... but I'm at a loss for what else to do if the problem can't be remedied from eccodes side. I'm certainly out of my depth and grasping at straws to make sense of things, so if anyone with more knowledge and insight into the awfulness of the grib2 world can provide a means of encoding the appropriate array offset for the multi-field arrays, please please PLEASE speak up! Moving forward, I see a few options...
If we get stuck with option 3, we need to figure out how to preserve what is effectively byte streaming, while tricking eccodes into thinking it's getting a file. Unfortunately, you cannot just pass eccodes a BytesIO object to achieve the desired behavior. I don't know the core of fsspec very well, but I know it was intended to handle remote file streaming, so perhaps the GRIBCodec class needs to be refactored to use fsspec rather than bytes? Input from the maintainers is appreciated... Edit: One more idea, is that if we know how to brute-force the byte offsets to read the appropriate array, that could work. Unfortunately, there are no grib2 fields/IDs that broadcast the presence of multi-field messages, at least in the eccodes API and certainly not encoded in the file metadata that I can tell. |
From what you have linked, the C code in eccodes wants to use a file descriptor, i.e., real local open file. That means, fsspec can't do it (except to copy the bytes to a local temporary location, RAMdisk or something). How strange that they should have a completely different way to handle a bytes buffer versus a file!
You said you had code to detect this? I don't see from the commit. But if knowing beforehand is enough, we can store the fact in the init parameters for the grib codec. |
@martindurant Sorry for the confusion regarding the commit - that code doesn't explicitly detect the presence of multi-field messages, but by relying on In short, if you keep track of the That said, clearly |
It sounds like, we can find multi cases, then, but only if we copy messages to disk first. That's totally doable and what scan_grib did do once upon a time. For decoding at read time, well it would work but be pretty annoying! We could advise users to ensure that their temporary storage is memory-based.
Interesting! Would you mind showing this for an example multi file? I wonder if with the two offsets we have enough information to make two non-multi messages for eccodes at runtime. |
I'm encountering an interesting issue where the results of scan_grib differ from interacting with a file via xarray/cfgrib. Particularly, it is not detecting certain variables. Installation information:
In this particular case, the missing variable is the 10 meter V wind component. Using scan_grib:
As you can see, there is no
v10
variable output. Here's the printed output directly from scan_grib, showing that the data is missing here too:However, when I use xarray/cfgrib, the
v10
variable is present and accounted for:Output from wgrib2:
Any suggestions on what I may be doing wrong, or where the issue might lie? Happy to make a PR if there's a bug, just not sure if 1) there is one and 2) where to start addressing it.
The text was updated successfully, but these errors were encountered: