Exceeding NWM API budget for subsetting NWM medium-range forecasts to CAMELS and HEFS basin COMIDs #257

joshsturtevant · 2025-01-15T16:24:05Z

1. Requester Information:
Josh Sturtevant ([email protected]; PhD student)
Andy Wood ([email protected]; Project PI)

2. Link to Existing Infrastructure Ticket:
n/a

3. Justification for increased budget:
A core tenet of the CIROH Hydrologic Prediction Testbed (CHPT) project is the development of benchmarks against which to compare CIROH modeling experiments for the purposes of understanding relative performance and measuring incremental progress. One key operational reference capability (or benchmark) in the CHPT is the NWM. Nearly all published hydrologic model benchmarking studies to date evaluate the retrospective NWM performance, not the operational medium-range streamflow forecasts (as archived on Google Cloud). In support of improved R2O, this project proposed to evaluate the NWM medium-range forecasts across a subset of important basins in the US (CAMELS and HEFS testbed basins). This request for an increased NWM API budget is in support of developing these standards and benchmarks as part of community protocols within the CHPT.

3. Resource Requirements:
Increased quota for NWM API.

Options:

Cloud Provider: AWS/Azure/GCP
Google BigQuery (NWM API)
Required Services in the Cloud:
n/a

4. Timeline:
January 2025

5. Security and Compliance Requirements:
n/a

6. Cost Estimation:
We estimate that the medium-range forecasts on Google Big Query are a total of ~200TB (across all ensemble members, forecast lead times, initializations, and forecast cycles). At a cost of $5/TB, this is about $1,000 to query this data. Since we are subsetting the 2.7 million NWM NHDplusv2 reaches down to <1,000 reaches (i.e., COMIDs), the egress data will be only in the 10s of GB, so costs should be quite minimal (<$100). No data storage is needed.

We anticipate needing to query and subset the data twice: once for the CAMELS basins, and a second time for the HEFS testbed basins. As a result, we estimate that the costs will be a maximum of ~$2,500 (including additional budget for testing/verifying). This work will be a one-time data request in January, with subsequent requests excepted to be well below the $500/month budget.

7. Approval:
Once Form is submitted, we will email the CIROH management to get the approval.

**8. Project to charge to: (For CIROH IT Admin to fill out based on approval process) **
Indicate the necessary approval processes or sign-offs required for the request.

benlee0423 · 2025-01-16T04:57:50Z

@joshsturtevant
Would it be possible to send and estimate a small subset of requests?
Google document was not very user friendly, it is little vague what the data is in their pricing.

Based on https://cloud.google.com/bigquery/pricing#on_demand_pricing,
$6.25 per TiB
The first 1 TiB per month is free
6.25 * 200 = $1250

joshsturtevant · 2025-01-16T14:35:52Z

@benlee0423, we can certainly optimize our workflows to only query this dataset once. Also, I believe my initial estimate of 200tb for the total size of the 2018-2024 NWM medium-range forecast archive is an upper limit, as it extrapolates out from the larger NWM3.0 file sizes (which are hourly, 7-member forecasts, contrasting with the earlier NWM2.0 forecasts which are only 3-hourly, single-member runs).

benlee0423 · 2025-01-16T21:38:51Z

@joshsturtevant
I believe the requests you sent on the night of January 13 resulted in a 500 Server Error on January 14. Despite the server error, the requests were still charged on our side, and I see a $400 billing entry in BigQuery for January 14.

It seems that the pricing model described in the official documentation is somewhat misleading. Specifically, the scenario I described above does not align with their stated model. The charges are based on the actual queries executed in the SELECT statements, which are included in the Python source code in this repository.

To better estimate costs and prevent high charges, I recommend testing with a smaller subset of requests, such as 1/50 or 1/100 of the original size. This will allow us to calculate the cost more accurately before incurring a significant expense.

If you decide to proceed with this approach, please inform me in advance. Note that the associated costs are not calculated in real-time—we only receive the billing data one day later.

I hope this explanation clarifies the situation. Please let me know if you have any questions or need further assistance.

joshsturtevant · 2025-01-16T22:16:06Z

@benlee0423, ah I think I now understand what you were originally asking. Thanks for clarifying.

Before I hit the 500 server error, we were able to download a total of 1,628 NWM medium-range forecasts. So, knowing that this cost about $400, we should now have enough info to estimate the total cost for this request:

The Google NWM operational medium-range forecasts archive is available from 2018-09-17 to yesterday, with four forecast cycles per day (00, 06, 12, and 18z). If we want to download and processes all cycles across 2018-10-01 to 2024-09-30 (five water years), this works out to 8,768 forecast initializations. Assuming the costs scale linearly from our initial $400 charge on Jan 13, we would estimate that the total costs would be about $2,150 (since it cost $400 to process & download 18.6% of the dataset of interest). Note that this initial request only subset the CAMELS basins COMIDs -- to limit our queries, I think we would want to rerun this from the top to subset out both CAMELS basins and HEFS testbed basins, since we anticipate wanting NWM forecasts across both domains.

Lastly, I will note that it is possible this calculation might underestimate the total final cost. This is because the later part of the record (NWM3.0) has higher temporal resolution and more ensemble members than the earlier part of the record (NWM2.1). Our Jan 13th download was a parallel request starting at the beginning of each year (2018-2024), but the final 1.6k files are definitely more heavily skewed towards the earlier part of the record.

@andywood, including you here since you are the project PI.

benlee0423 · 2025-01-16T22:26:17Z

@arpita0911patel
Including for the conversation.
Please let us know if the estimate of $2,150 is reasonable and can proceed with the approval.

andywood · 2025-01-17T01:47:16Z

Thanks for the estimation, Josh. One comment I would add is that the resulting extracted datasets will benefit not only our projects, but all CIROH projects that are working on actual streamflow forecasts and wish to compare against a benchmark of the recent NWM operational forecasts. Of course, CAMELS and the HEFS Testbed (about 200 basins) are a small subset of the potential settings of different research project, but they are a commonly used subset. We will process and publish the resulting resources on HydroShare in some form so that it's available for CIROH. Lastly, my projects can also help pay for it if the UA Cy-Inf budgets cannot (particularly the Testbed project, for which assembling operational benchmarks is a major task). Happy to chat about it. Cheers, Andy

…

On Thu, Jan 16, 2025 at 3:26 PM Benjamin Lee ***@***.***> wrote: @arpita0911patel <https://github.com/arpita0911patel> Including for the conversation. Please let us know if the estimate of $2,150 is reasonable and can proceed with the approval. — Reply to this email directly, view it on GitHub <#257 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIKARNF4RY3EGHEVEVMXXD2LAWZ5AVCNFSM6AAAAABVHUHJICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJXGAZTGNZYGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

arpita0911patel · 2025-01-17T15:09:10Z

Thank you for submitting this request. I'll get back to you today on this after checking with management.

Thank you,
Arpita

jameshalgren · 2025-01-17T19:52:56Z

ping @KMarkert

@joshsturtevant -- Confirm -- you are trying to create a dataset with the time series of flow outputs from the NWM forecasts for the CAMELS and HEFS (and presumably other HARBOR) datasets. Correct? Will you also need to collect the forcings from these forecasts? (these are not retained in the BigQuery database...)

joshsturtevant · 2025-01-17T20:25:29Z

Hi @jameshalgren, correct. We are really only interested in the streamflow variable of the NWM medium-range forecasts, and do not need the NWM forcings.

andywood · 2025-01-17T20:42:40Z

One other comment, good question about other sites, eg HARBOR. I expect that we may want additional sites at some point (not right away) but it would be nothing like the contents of HARBOR (19K+ gages). @josh, I would include the short-range forecasts as well while we're costing it out. Those will be another good benchmark. Cheers, Andy

…

On Fri, Jan 17, 2025 at 1:25 PM Josh Sturtevant ***@***.***> wrote: Hi @jameshalgren <https://github.com/jameshalgren>, correct. We are really only interested in the streamflow variable of the NWM medium-range forecasts, and do not need the NWM forcings. — Reply to this email directly, view it on GitHub <#257 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIKAROGK6MD6ALB45RZVJD2LFRM5AVCNFSM6AAAAABVHUHJICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJZGEZTIOBZGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jameshalgren · 2025-01-17T21:16:50Z

What if we took advantage of the parquet files that are an intermediate resource?
e.g., https://console.cloud.google.com/storage/browser/national-water-model-parq/channel_rt/medium_range;tab=objects?prefix=nwm.20250116
(i.e., mirror the compressed parquet data to a storage location adjacent to the HPC; I don't like downloading, usually, but the parquet files are much more compact and play with xarray/dask very nicely/speedily.)

Alternatively, if we are looking at bulk download, it might be easier to query the tables with SQL directly: https://goo.gle/nwm-on-bq

joshsturtevant · 2025-01-17T21:50:52Z

Thanks for the brainstorming, @jameshalgren. From what I can tell, it seems the parquet files are only available through 2020-01-16. We actually started down the bulk download path initially, but quickly realized that downloading 100s of TBs of NWM forecast data to then subset out only 10s of GBs was highly inefficient, given the availability of the NWM API for exactly these sorts of tasks (albeit smaller requests, perhaps).

If @andywood is willing to have our project cover the associated BQ costs, is AWI willing to temporarily lift our API quota for this one-time request? We are happy to submit the job during off-peak hours (e.g., weekends and/or evenings) to avoid inconveniencing other users.

andywood · 2025-01-17T22:49:11Z

Perhaps we can move this chat to a slack channel? Some things are unclear about the options. If an SQL query can extract site data efficiently (ie, not one query per forecast, which would run into millions of queries), that could be an option. If the parquet files are set up right, perhaps one file with dimensions of M initializations vs N lead times, one file per station (since they are 2D), that might be efficient -- though going farther back than 2000 is preferable. I can cover the download/extraction costs from the Testbed project if this isn't justified by the Cyb-Inf proj. scope. Cheers, Andy

…

On Fri, Jan 17, 2025 at 2:51 PM Josh Sturtevant ***@***.***> wrote: Thanks for the brainstorming, @jameshalgren <https://github.com/jameshalgren>. From what I can tell, it seems the parquet files are only available through 2020-01-16. We actually started down the bulk download path initially, but quickly realized that downloading 100s of TBs of NWM forecast data to then subset out only 10s of GBs was highly inefficient, given the availability of the NWM API for exactly these sorts of tasks (albeit smaller requests, perhaps). If @andywood <https://github.com/andywood> is willing to have our project cover the associated BQ costs, is AWI willing to temporarily lift our API quota for this one-time request? We are happy to submit the job during off-peak hours (e.g., weekends and/or evenings) to avoid inconveniencing other users. — Reply to this email directly, view it on GitHub <#257 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIKARNHP67LKTKWSOY4GV32LF3NDAVCNFSM6AAAAABVHUHJICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJZGI3TGNRSG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

arpita0911patel · 2025-01-30T16:55:52Z

@joshsturtevant - @benlee0423 has added alerts for the NWM BigQuery API. Let's coordinate on slack when you are ready to use it.

arpita0911patel self-assigned this Jan 17, 2025

arpita0911patel added R2OHC Resource request Infrastructure Request - Google, AWS, On-premises google labels Jan 17, 2025

arpita0911patel assigned benlee0423 Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceeding NWM API budget for subsetting NWM medium-range forecasts to CAMELS and HEFS basin COMIDs #257

Exceeding NWM API budget for subsetting NWM medium-range forecasts to CAMELS and HEFS basin COMIDs #257

joshsturtevant commented Jan 15, 2025

benlee0423 commented Jan 16, 2025

joshsturtevant commented Jan 16, 2025

benlee0423 commented Jan 16, 2025

joshsturtevant commented Jan 16, 2025

benlee0423 commented Jan 16, 2025

andywood commented Jan 17, 2025 via email

arpita0911patel commented Jan 17, 2025

jameshalgren commented Jan 17, 2025

joshsturtevant commented Jan 17, 2025

andywood commented Jan 17, 2025 via email

jameshalgren commented Jan 17, 2025

joshsturtevant commented Jan 17, 2025

andywood commented Jan 17, 2025 via email

arpita0911patel commented Jan 30, 2025

Exceeding NWM API budget for subsetting NWM medium-range forecasts to CAMELS and HEFS basin COMIDs #257

Exceeding NWM API budget for subsetting NWM medium-range forecasts to CAMELS and HEFS basin COMIDs #257

Comments

joshsturtevant commented Jan 15, 2025

benlee0423 commented Jan 16, 2025

joshsturtevant commented Jan 16, 2025

benlee0423 commented Jan 16, 2025

joshsturtevant commented Jan 16, 2025

benlee0423 commented Jan 16, 2025

andywood commented Jan 17, 2025 via email

arpita0911patel commented Jan 17, 2025

jameshalgren commented Jan 17, 2025

joshsturtevant commented Jan 17, 2025

andywood commented Jan 17, 2025 via email

jameshalgren commented Jan 17, 2025

joshsturtevant commented Jan 17, 2025

andywood commented Jan 17, 2025 via email

arpita0911patel commented Jan 30, 2025