Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceeding NWM API budget for subsetting NWM medium-range forecasts to CAMELS and HEFS basin COMIDs #257

Open
joshsturtevant opened this issue Jan 15, 2025 · 14 comments
Assignees
Labels
google R2OHC Resource request Infrastructure Request - Google, AWS, On-premises

Comments

@joshsturtevant
Copy link

1. Requester Information:
Josh Sturtevant ([email protected]; PhD student)
Andy Wood ([email protected]; Project PI)

2. Link to Existing Infrastructure Ticket:
n/a

3. Justification for increased budget:
A core tenet of the CIROH Hydrologic Prediction Testbed (CHPT) project is the development of benchmarks against which to compare CIROH modeling experiments for the purposes of understanding relative performance and measuring incremental progress. One key operational reference capability (or benchmark) in the CHPT is the NWM. Nearly all published hydrologic model benchmarking studies to date evaluate the retrospective NWM performance, not the operational medium-range streamflow forecasts (as archived on Google Cloud). In support of improved R2O, this project proposed to evaluate the NWM medium-range forecasts across a subset of important basins in the US (CAMELS and HEFS testbed basins). This request for an increased NWM API budget is in support of developing these standards and benchmarks as part of community protocols within the CHPT.

3. Resource Requirements:
Increased quota for NWM API.

Options:

  1. Cloud Provider: AWS/Azure/GCP
    Google BigQuery (NWM API)

  2. Required Services in the Cloud:
    n/a

4. Timeline:
January 2025

5. Security and Compliance Requirements:
n/a

6. Cost Estimation:
We estimate that the medium-range forecasts on Google Big Query are a total of ~200TB (across all ensemble members, forecast lead times, initializations, and forecast cycles). At a cost of $5/TB, this is about $1,000 to query this data. Since we are subsetting the 2.7 million NWM NHDplusv2 reaches down to <1,000 reaches (i.e., COMIDs), the egress data will be only in the 10s of GB, so costs should be quite minimal (<$100). No data storage is needed.

We anticipate needing to query and subset the data twice: once for the CAMELS basins, and a second time for the HEFS testbed basins. As a result, we estimate that the costs will be a maximum of ~$2,500 (including additional budget for testing/verifying). This work will be a one-time data request in January, with subsequent requests excepted to be well below the $500/month budget.

7. Approval:
Once Form is submitted, we will email the CIROH management to get the approval.

**8. Project to charge to: (For CIROH IT Admin to fill out based on approval process) **
Indicate the necessary approval processes or sign-offs required for the request.

@benlee0423
Copy link

@joshsturtevant
Would it be possible to send and estimate a small subset of requests?
Google document was not very user friendly, it is little vague what the data is in their pricing.

Based on https://cloud.google.com/bigquery/pricing#on_demand_pricing,
$6.25 per TiB
The first 1 TiB per month is free
6.25 * 200 = $1250

@joshsturtevant
Copy link
Author

@benlee0423, we can certainly optimize our workflows to only query this dataset once. Also, I believe my initial estimate of 200tb for the total size of the 2018-2024 NWM medium-range forecast archive is an upper limit, as it extrapolates out from the larger NWM3.0 file sizes (which are hourly, 7-member forecasts, contrasting with the earlier NWM2.0 forecasts which are only 3-hourly, single-member runs).

@benlee0423
Copy link

@joshsturtevant
I believe the requests you sent on the night of January 13 resulted in a 500 Server Error on January 14. Despite the server error, the requests were still charged on our side, and I see a $400 billing entry in BigQuery for January 14.

It seems that the pricing model described in the official documentation is somewhat misleading. Specifically, the scenario I described above does not align with their stated model. The charges are based on the actual queries executed in the SELECT statements, which are included in the Python source code in this repository.

To better estimate costs and prevent high charges, I recommend testing with a smaller subset of requests, such as 1/50 or 1/100 of the original size. This will allow us to calculate the cost more accurately before incurring a significant expense.

If you decide to proceed with this approach, please inform me in advance. Note that the associated costs are not calculated in real-time—we only receive the billing data one day later.

I hope this explanation clarifies the situation. Please let me know if you have any questions or need further assistance.

@joshsturtevant
Copy link
Author

@benlee0423, ah I think I now understand what you were originally asking. Thanks for clarifying.

Before I hit the 500 server error, we were able to download a total of 1,628 NWM medium-range forecasts. So, knowing that this cost about $400, we should now have enough info to estimate the total cost for this request:

The Google NWM operational medium-range forecasts archive is available from 2018-09-17 to yesterday, with four forecast cycles per day (00, 06, 12, and 18z). If we want to download and processes all cycles across 2018-10-01 to 2024-09-30 (five water years), this works out to 8,768 forecast initializations. Assuming the costs scale linearly from our initial $400 charge on Jan 13, we would estimate that the total costs would be about $2,150 (since it cost $400 to process & download 18.6% of the dataset of interest). Note that this initial request only subset the CAMELS basins COMIDs -- to limit our queries, I think we would want to rerun this from the top to subset out both CAMELS basins and HEFS testbed basins, since we anticipate wanting NWM forecasts across both domains.

Lastly, I will note that it is possible this calculation might underestimate the total final cost. This is because the later part of the record (NWM3.0) has higher temporal resolution and more ensemble members than the earlier part of the record (NWM2.1). Our Jan 13th download was a parallel request starting at the beginning of each year (2018-2024), but the final 1.6k files are definitely more heavily skewed towards the earlier part of the record.

@andywood, including you here since you are the project PI.

@benlee0423
Copy link

@arpita0911patel
Including for the conversation.
Please let us know if the estimate of $2,150 is reasonable and can proceed with the approval.

@andywood
Copy link

andywood commented Jan 17, 2025 via email

@arpita0911patel
Copy link
Member

Thank you for submitting this request. I'll get back to you today on this after checking with management.

Thank you,
Arpita

@arpita0911patel arpita0911patel self-assigned this Jan 17, 2025
@arpita0911patel arpita0911patel added R2OHC Resource request Infrastructure Request - Google, AWS, On-premises google labels Jan 17, 2025
@jameshalgren
Copy link
Member

ping @KMarkert

@joshsturtevant -- Confirm -- you are trying to create a dataset with the time series of flow outputs from the NWM forecasts for the CAMELS and HEFS (and presumably other HARBOR) datasets. Correct? Will you also need to collect the forcings from these forecasts? (these are not retained in the BigQuery database...)

@joshsturtevant
Copy link
Author

Hi @jameshalgren, correct. We are really only interested in the streamflow variable of the NWM medium-range forecasts, and do not need the NWM forcings.

@andywood
Copy link

andywood commented Jan 17, 2025 via email

@jameshalgren
Copy link
Member

What if we took advantage of the parquet files that are an intermediate resource?
e.g., https://console.cloud.google.com/storage/browser/national-water-model-parq/channel_rt/medium_range;tab=objects?prefix=nwm.20250116
(i.e., mirror the compressed parquet data to a storage location adjacent to the HPC; I don't like downloading, usually, but the parquet files are much more compact and play with xarray/dask very nicely/speedily.)

Alternatively, if we are looking at bulk download, it might be easier to query the tables with SQL directly: https://goo.gle/nwm-on-bq

@joshsturtevant
Copy link
Author

Thanks for the brainstorming, @jameshalgren. From what I can tell, it seems the parquet files are only available through 2020-01-16. We actually started down the bulk download path initially, but quickly realized that downloading 100s of TBs of NWM forecast data to then subset out only 10s of GBs was highly inefficient, given the availability of the NWM API for exactly these sorts of tasks (albeit smaller requests, perhaps).

If @andywood is willing to have our project cover the associated BQ costs, is AWI willing to temporarily lift our API quota for this one-time request? We are happy to submit the job during off-peak hours (e.g., weekends and/or evenings) to avoid inconveniencing other users.

@andywood
Copy link

andywood commented Jan 17, 2025 via email

@arpita0911patel
Copy link
Member

@joshsturtevant - @benlee0423 has added alerts for the NWM BigQuery API. Let's coordinate on slack when you are ready to use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
google R2OHC Resource request Infrastructure Request - Google, AWS, On-premises
Projects
None yet
Development

No branches or pull requests

5 participants