-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exceeding NWM API budget for subsetting NWM medium-range forecasts to CAMELS and HEFS basin COMIDs #257
Comments
@joshsturtevant Based on https://cloud.google.com/bigquery/pricing#on_demand_pricing, |
@benlee0423, we can certainly optimize our workflows to only query this dataset once. Also, I believe my initial estimate of 200tb for the total size of the 2018-2024 NWM medium-range forecast archive is an upper limit, as it extrapolates out from the larger NWM3.0 file sizes (which are hourly, 7-member forecasts, contrasting with the earlier NWM2.0 forecasts which are only 3-hourly, single-member runs). |
@joshsturtevant It seems that the pricing model described in the official documentation is somewhat misleading. Specifically, the scenario I described above does not align with their stated model. The charges are based on the actual queries executed in the SELECT statements, which are included in the Python source code in this repository. To better estimate costs and prevent high charges, I recommend testing with a smaller subset of requests, such as 1/50 or 1/100 of the original size. This will allow us to calculate the cost more accurately before incurring a significant expense. If you decide to proceed with this approach, please inform me in advance. Note that the associated costs are not calculated in real-time—we only receive the billing data one day later. I hope this explanation clarifies the situation. Please let me know if you have any questions or need further assistance. |
@benlee0423, ah I think I now understand what you were originally asking. Thanks for clarifying. Before I hit the 500 server error, we were able to download a total of 1,628 NWM medium-range forecasts. So, knowing that this cost about $400, we should now have enough info to estimate the total cost for this request: The Google NWM operational medium-range forecasts archive is available from 2018-09-17 to yesterday, with four forecast cycles per day (00, 06, 12, and 18z). If we want to download and processes all cycles across 2018-10-01 to 2024-09-30 (five water years), this works out to 8,768 forecast initializations. Assuming the costs scale linearly from our initial $400 charge on Jan 13, we would estimate that the total costs would be about $2,150 (since it cost $400 to process & download 18.6% of the dataset of interest). Note that this initial request only subset the CAMELS basins COMIDs -- to limit our queries, I think we would want to rerun this from the top to subset out both CAMELS basins and HEFS testbed basins, since we anticipate wanting NWM forecasts across both domains. Lastly, I will note that it is possible this calculation might underestimate the total final cost. This is because the later part of the record (NWM3.0) has higher temporal resolution and more ensemble members than the earlier part of the record (NWM2.1). Our Jan 13th download was a parallel request starting at the beginning of each year (2018-2024), but the final 1.6k files are definitely more heavily skewed towards the earlier part of the record. @andywood, including you here since you are the project PI. |
@arpita0911patel |
Thanks for the estimation, Josh. One comment I would add is that the
resulting extracted datasets will benefit not only our projects, but all
CIROH projects that are working on actual streamflow forecasts and wish to
compare against a benchmark of the recent NWM operational forecasts. Of
course, CAMELS and the HEFS Testbed (about 200 basins) are a small subset
of the potential settings of different research project, but they are a
commonly used subset.
We will process and publish the resulting resources on HydroShare in some
form so that it's available for CIROH.
Lastly, my projects can also help pay for it if the UA Cy-Inf budgets
cannot (particularly the Testbed project, for which assembling
operational benchmarks is a major task). Happy to chat about it.
Cheers,
Andy
…On Thu, Jan 16, 2025 at 3:26 PM Benjamin Lee ***@***.***> wrote:
@arpita0911patel <https://github.com/arpita0911patel>
Including for the conversation.
Please let us know if the estimate of $2,150 is reasonable and can proceed
with the approval.
—
Reply to this email directly, view it on GitHub
<#257 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABIKARNF4RY3EGHEVEVMXXD2LAWZ5AVCNFSM6AAAAABVHUHJICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJXGAZTGNZYGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you for submitting this request. I'll get back to you today on this after checking with management. Thank you, |
ping @KMarkert @joshsturtevant -- Confirm -- you are trying to create a dataset with the time series of flow outputs from the NWM forecasts for the CAMELS and HEFS (and presumably other HARBOR) datasets. Correct? Will you also need to collect the forcings from these forecasts? (these are not retained in the BigQuery database...) |
Hi @jameshalgren, correct. We are really only interested in the streamflow variable of the NWM medium-range forecasts, and do not need the NWM forcings. |
One other comment, good question about other sites, eg HARBOR. I expect
that we may want additional sites at some point (not right away) but it
would be nothing like the contents of HARBOR (19K+ gages).
@josh, I would include the short-range forecasts as well while we're
costing it out. Those will be another good benchmark.
Cheers,
Andy
…On Fri, Jan 17, 2025 at 1:25 PM Josh Sturtevant ***@***.***> wrote:
Hi @jameshalgren <https://github.com/jameshalgren>, correct. We are
really only interested in the streamflow variable of the NWM medium-range
forecasts, and do not need the NWM forcings.
—
Reply to this email directly, view it on GitHub
<#257 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABIKAROGK6MD6ALB45RZVJD2LFRM5AVCNFSM6AAAAABVHUHJICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJZGEZTIOBZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
What if we took advantage of the parquet files that are an intermediate resource? Alternatively, if we are looking at bulk download, it might be easier to query the tables with SQL directly: https://goo.gle/nwm-on-bq |
Thanks for the brainstorming, @jameshalgren. From what I can tell, it seems the parquet files are only available through 2020-01-16. We actually started down the bulk download path initially, but quickly realized that downloading 100s of TBs of NWM forecast data to then subset out only 10s of GBs was highly inefficient, given the availability of the NWM API for exactly these sorts of tasks (albeit smaller requests, perhaps). If @andywood is willing to have our project cover the associated BQ costs, is AWI willing to temporarily lift our API quota for this one-time request? We are happy to submit the job during off-peak hours (e.g., weekends and/or evenings) to avoid inconveniencing other users. |
Perhaps we can move this chat to a slack channel? Some things are unclear
about the options. If an SQL query can extract site data efficiently (ie,
not one query per forecast, which would run into millions of queries), that
could be an option. If the parquet files are set up right, perhaps one
file with dimensions of M initializations vs N lead times, one file per
station (since they are 2D), that might be efficient -- though going
farther back than 2000 is preferable.
I can cover the download/extraction costs from the Testbed project if this
isn't justified by the Cyb-Inf proj. scope.
Cheers,
Andy
…On Fri, Jan 17, 2025 at 2:51 PM Josh Sturtevant ***@***.***> wrote:
Thanks for the brainstorming, @jameshalgren
<https://github.com/jameshalgren>. From what I can tell, it seems the
parquet files are only available through 2020-01-16. We actually started
down the bulk download path initially, but quickly realized that
downloading 100s of TBs of NWM forecast data to then subset out only 10s of
GBs was highly inefficient, given the availability of the NWM API for
exactly these sorts of tasks (albeit smaller requests, perhaps).
If @andywood <https://github.com/andywood> is willing to have our project
cover the associated BQ costs, is AWI willing to temporarily lift our API
quota for this one-time request? We are happy to submit the job during
off-peak hours (e.g., weekends and/or evenings) to avoid inconveniencing
other users.
—
Reply to this email directly, view it on GitHub
<#257 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABIKARNHP67LKTKWSOY4GV32LF3NDAVCNFSM6AAAAABVHUHJICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOJZGI3TGNRSG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@joshsturtevant - @benlee0423 has added alerts for the NWM BigQuery API. Let's coordinate on slack when you are ready to use it. |
1. Requester Information:
Josh Sturtevant ([email protected]; PhD student)
Andy Wood ([email protected]; Project PI)
2. Link to Existing Infrastructure Ticket:
n/a
3. Justification for increased budget:
A core tenet of the CIROH Hydrologic Prediction Testbed (CHPT) project is the development of benchmarks against which to compare CIROH modeling experiments for the purposes of understanding relative performance and measuring incremental progress. One key operational reference capability (or benchmark) in the CHPT is the NWM. Nearly all published hydrologic model benchmarking studies to date evaluate the retrospective NWM performance, not the operational medium-range streamflow forecasts (as archived on Google Cloud). In support of improved R2O, this project proposed to evaluate the NWM medium-range forecasts across a subset of important basins in the US (CAMELS and HEFS testbed basins). This request for an increased NWM API budget is in support of developing these standards and benchmarks as part of community protocols within the CHPT.
3. Resource Requirements:
Increased quota for NWM API.
Options:
Cloud Provider: AWS/Azure/GCP
Google BigQuery (NWM API)
Required Services in the Cloud:
n/a
4. Timeline:
January 2025
5. Security and Compliance Requirements:
n/a
6. Cost Estimation:
We estimate that the medium-range forecasts on Google Big Query are a total of ~200TB (across all ensemble members, forecast lead times, initializations, and forecast cycles). At a cost of $5/TB, this is about $1,000 to query this data. Since we are subsetting the 2.7 million NWM NHDplusv2 reaches down to <1,000 reaches (i.e., COMIDs), the egress data will be only in the 10s of GB, so costs should be quite minimal (<$100). No data storage is needed.
We anticipate needing to query and subset the data twice: once for the CAMELS basins, and a second time for the HEFS testbed basins. As a result, we estimate that the costs will be a maximum of ~$2,500 (including additional budget for testing/verifying). This work will be a one-time data request in January, with subsequent requests excepted to be well below the $500/month budget.
7. Approval:
Once Form is submitted, we will email the CIROH management to get the approval.
**8. Project to charge to: (For CIROH IT Admin to fill out based on approval process) **
Indicate the necessary approval processes or sign-offs required for the request.
The text was updated successfully, but these errors were encountered: