-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a notebook for faster GRIB aggregations. #64
base: main
Are you sure you want to change the base?
Conversation
👋 Thanks for opening this PR! The Cookbook will be automatically built with GitHub Actions. To see the status of your deployment, click below. |
Happy to merge and/or review this whenever it's ready! |
@norlandrhagen The notebook is ready but it needs review. Can you please review and suggest the changes? |
Looks great @Anu-Ra-g! A few small comments and copy edits. Thanks for contributing. It is probably worth asking @martindurant for a review of the implementation since he is the GRIB + Kerchunk guru. Overview: In this tutorial we are going to demonstrate building kerchunk aggregations of NODD grib2 weather forecasts fast. This workflow primarily involves xarray-datatree, pandas and grib_tree function released in kerchunkv0.2.3 for the operation. For this operation we will be looking at GRIB2 files generated by NOAA Global Ensemble Forecast System (GEFS), is a weather forecast model made up of 21 separate forecasts, or ensemble members. With global coverage, GEFS is produced four times a day with weather forecasts going out to 16 days, with an update frequency of 4 times a day, every 6 hours starting at midnight. For building the aggregation, we're going to build a hierarchical data model, to view the whole dataset ,from a set of scanned grib messages with the help of grib_tree function. This data model can be opened directly using either zarr or xarray datatree. This way of building the aggregation is very slow. Here we're going to use xarray-datatree to open and view it. Every NODD cloud platform stores the grib file along with its .idx(index) file, in text format. The purpose of using the idx files in the aggregation is that the k(erchunk) index data looks a lot like the idx files that are present for every grib file in NODD's GCS and AWS archive though. This way of building of aggregation only works for a particular horizon file irrespective of the run time of the model. Note: This way of building of aggregation only works for a particular horizon file irrespective of the run time of the model. Now we're going to need a mapping from our grib/zarr metadata(stored in the grib_tree output) to the attributes in the idx files. They are unique for each time horizon e.g. we need to build a unique mapping for the 1 hour forecast, the 2 hour forecast and so on. So in this step we're going to create a mapping for a single grib file and its corresponding idx files in order, which will be used in later steps for building the aggregation. Before that let's see what grib data we're extracting from the datatree. The metadata that we'll be extracting will be static in nature. We're going to use a single node by accessing it. -> Now we're going to need a mapping from our GRIB/Zarr metadata(stored in the We'll start by examining the GRIB data. The metadata that we'll be extracting will be static in nature. We're going to use a single node by accessing it with datatree. Now if we parse the runtime from the idx file , we can build a fully compatible k_index(kerchunk index) for that particular file. Before creating the index, we need to clean some of the data in the mapping and index dataframe for the some variables as they tend to contain duplicate values, as demonstrated below. -> Now that we have pared the runtime from the For the final step of the aggregation, we will create an index for each GRIB file to cover a two-month period starting from the specified date and convert it into one combined index and we can store this index for later use. We will be using the 6-hour horizon file for building the aggregation, this will be from 2017-01-01 to 2017-02-28. This is because as we already know this way of aggregation only works for a particular horizon file. With this way of building the aggregation we can index a whole of forecasts. -> For the final step of the aggregation, we will create an index for each GRIB file to cover a two-month period and convert it into one combined index. We can store this index for later use. We will be using the 6-hour horizon file for building the aggregation, this will be from 2017-01-01 to 2017-02-28. The difference between idx and k_index(kerchunk index) that we built in the above in the above step, is that the former indexes the grib messages and the latter indexes the variables in those messages. Now we'll need a tree model from grib_tree function to reinflate the part or the whole of the index i.e. variables in the messages as per our needs. The important point to note here is that the tree model should be made from the grib file(s) of the repository that we are indexing. -> The difference between |
@norlandrhagen I've made the suggested changes and updated some parts of the suggestions. |
Nice. It looks like the book build is failing with:
Maybe the kerchunk version needs to be bumped? |
@Anu-Ra-g , do you need a release of kerchunk? I'll merge your waiting PRs now, and we can handle any cleanup later. |
@martindurant I've made the PRs #497, #498, #499 in order to support this notebook. |
This is notebook is developed as a part of Google Summer of Code 2024. It describes how we can make larger aggregations for the
GRIB
files hosted in NODD program, in a small amount of time. The functions and operations used in this notebook, will be a part of the new version of kerchunk. This notebook is still in a draft phase, because it depends on this PR #63 to be merged first as the old pre-commit configuration is failing on the commits and some other updates as needed.