Skip to content

Commit

Permalink
Merge pull request 2i2c-org#4351 from sgibson91/budgetalerts
Browse files Browse the repository at this point in the history
Add budget alerts based on forecasts
  • Loading branch information
sgibson91 authored Jul 4, 2024
2 parents b97badf + 75cb922 commit 2d542b9
Show file tree
Hide file tree
Showing 20 changed files with 220 additions and 33 deletions.
39 changes: 39 additions & 0 deletions docs/howto/budget-alerts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
(howto:enable-budget-alerts)=
# Enable Budget Alerts

This document describes how to enable budget alerts for a cluster.

```{note}
This feature is currently only available on GCP!
```

## GCP

```{attention}
We can only enable budget alerting on GCP projects where we have enough permissions to enable APIs and view the billing account.
```

First, ensure the following APIs are enable on the GCP project you'd like to enable budget alerting for:

- [Cloud Resource Manager API](https://console.cloud.google.com/apis/library/cloudresourcemanager.googleapis.com)
- [Cloud Billing Budget API](https://console.cloud.google.com/apis/library/billingbudgets.googleapis.com)

Then edit the following variables in the relevant `.tfvars` file for the cluster.

- **Set `budget_alert_enabled = false`**, or delete the variable altogether (it is set to `true` in the `variables.tf` file).
This will ensure that the relevant resources will be created by terraform.
- **Set `billing_account_id`.**
This is the ID for the billing account linked to the project.
- You can find the ID by visiting the [Billing console](https://console.cloud.google.com/billing/linkedaccount?project=two-eye-two-see), ensuring the correct project is selected in the dropdown at the top.
In the dialogue box, click "Go to Linked Billing Account", and then click "Manage Billing Account" along the top.
This will open a pane that gives you the Billing Account ID.
- For accounts that we don't manage, the process is the same but _we may not have permission to view the Billing Account ID_.
In this case, we cannot enable budget alerting for this project.
- **Set `budget_alert_amount`.**
Current practice is to set this to the average expenditure of the last 3 months, plus 20%.
You can find values to calculate that in the [Billing Reports console](https://console.cloud.google.com/billing/0157F7-E3EA8C-25AC3C/reports?organizationId=184174754493&project=two-eye-two-see).
_Make sure you select only the project you are interested in from the Projects field in the Filters pane on the right side of the screen._
- If you are setting this up for a new cluster, you obviously don't have this information yet!
Instead, set the value to something like `500` and we can adjust as the community begins to use it.

With these variables set, you are ready to open a PR and perform a terraform apply!
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ howto/update-env.md
howto/upgrade-cluster/index.md
howto/troubleshoot/index.md
howto/regenerate-smce-creds.md
howto/budget-alerts
```

## Topic guides
Expand Down
35 changes: 35 additions & 0 deletions docs/topic/billing/budget-alerts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
(topic:billing:budget-alerts)=
# Cloud Billing Budget Alerts

"I forgot to turn off my cloud resources, your honor" as a reason for declaring
bankruptcy is second only to "The US healthcare system sucks, your honor" in the
US court system. "How much is my cloud going to cost?" is a big anxiety for a lot
of our users, and hence us. We set up billing alerts to help deal with this anxiety.

See [](howto:enable-budget-alerts) for instructions on enabling this feature.

## When are the alerts triggered?

Budget alerts are sent under two conditions:

1. When *forecasted monthly spend* at end of the month goes over our spending limit.
This is an *early warning* system, that helps us evaluate where spend is going
and make sure this is expected.
2. When *current actual spend* goves over 100% of our spending limit.

## What to do when we receive an alert?

The current goal is to just make sure we don't end up spending *wildly* more money
than budgeted. So if the forecasted spend busts through on day 5 of the month,
we might need to do something different than if it does on day 30. If it is expected
to overshoot by 500% vs by 10$, our actions might be different. One valid action is
we just adjust the forecast. As an organization, we need more experience with costs
to figure out what the right thing to do is. So our current primary goal would
be to work with our stakeholders and gather that experience.

## Where are these alerts sent?

Budget alerts are "Cliff Alerts" - they don't indicate a current outage (unlike
uptime checks), but indicate that we are perhaps heading in a direction that will
cause problems soon if we do not course correct. Hence, we do not send them to
PagerDuty but to our `[email protected]` email address.
3 changes: 2 additions & 1 deletion docs/topic/billing/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ chargeable-resources
accounts
reports
tools
```
budget-alerts
```
4 changes: 2 additions & 2 deletions docs/topic/billing/tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ project is about to cost more than a specific amount, or is forecast to go over
This can be *extremely helpful* in assuaging communities of cost overruns
but requires we have a prediction for *what numbers* to set these budgets at,
as well as what to do when the alerts fire. Usually, these alerts can be
set up manually in the UI or (preferably) via Terraform. We currently don't
utilize these, but we really should!
set up manually in the UI or (preferably) via Terraform. We currently only
utilise these on GCP, and more details can be found in [](topic:billing:budget-alerts).

More information: [GCP](https://cloud.google.com/billing/docs/how-to/budgets), [AWS](https://aws.amazon.com/aws-cost-management/aws-budgets/)
and [Azure](https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/cost-mgt-alerts-monitor-usage-spending).
2 changes: 1 addition & 1 deletion terraform/gcp/buckets.tf
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ resource "google_storage_bucket" "user_buckets" {
for_each = each.value.usage_logs ? [1] : []

content {
log_bucket = google_storage_bucket.usage_logs_bucket.name
log_bucket = google_storage_bucket.usage_logs_bucket[0].name
}
}

Expand Down
61 changes: 61 additions & 0 deletions terraform/gcp/budget-alerts.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Alerts sent to [email protected] for things that *will go bad* in the future
# if left unattended. Should *not* be used for immediate outages

resource "google_monitoring_notification_channel" "support_email" {
count = var.budget_alert_enabled ? 1 : 0
project = var.project_id
display_name = "[email protected] email"
type = "email"
labels = {
email_address = "[email protected]"
}
}

data "google_project" "project" {
project_id = var.project_id
}

# Need to explicitly enable https://console.cloud.google.com/apis/library/billingbudgets.googleapis.com?project=two-eye-two-see
resource "google_billing_budget" "budget" {
count = var.budget_alert_enabled ? 1 : 0

billing_account = var.billing_account_id
display_name = "Billing alert"

budget_filter {
# Use project number here, as project_name seems to be converted internally to number
# If we don't do this, `terraform apply` is not clean
# This is a bug in the google provider / budgets API https://github.com/hashicorp/terraform-provider-google/issues/8444
projects = ["projects/${data.google_project.project.number}"]
credit_types_treatment = "INCLUDE_ALL_CREDITS"
}

amount {
specified_amount {
currency_code = var.budget_alert_currency
units = var.budget_alert_amount
}
}

all_updates_rule {
monitoring_notification_channels = [
google_monitoring_notification_channel.support_email[0].id,
]
disable_default_iam_recipients = true
}
# NOTE: These threshold_rules *MUST BE ORDERED BY threshold_percent* in ascending order!
# If not, we'll run into https://github.com/hashicorp/terraform-provider-google/issues/8444
# and terraform apply won't be clean.
threshold_rules {
# Alert when *actual* spend reached 80% of budget
threshold_percent = 1.0
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
# Alert when *forecasted* spend is about to blow over our budget
# Adding the extra 1% to help terraform not redo this each time.
threshold_percent = 1.01
spend_basis = "FORECASTED_SPEND"
}

}
20 changes: 1 addition & 19 deletions terraform/gcp/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -28,26 +28,8 @@ terraform {
}

provider "google" {
# This was configured without full understanding of the implications to
# resolve the following error:
#
# Error: Error when reading or editing BillingBudget "...": googleapi: Error 403: Your application has authenticated using end user credentials from the Google Cloud SDK or Google Cloud Shell which are not supported by the billingbudgets.googleapis.com. We recommend configuring the billing/quota_project setting in gcloud or using a service account through the auth/impersonate_service_account setting. For more information about service accounts and how to use them in your application, see https://cloud.google.com/docs/authentication/. If you are getting this error with curl or similar tools, you may need to specify 'X-Goog-User-Project' HTTP header for quota and billing purposes. For more information regarding 'X-Goog-User-Project' header, please check https://cloud.google.com/apis/docs/system-parameters.
#
# Configuration reference:
# https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/provider_reference#user_project_override
#
# FIXME: Erik concluded that billing_project could be set to var.project_id at
# least for one cluster, but it required that the project where the
# cluster lived first enabled the GCP API: https://console.cloud.google.com/apis/library/cloudresourcemanager.googleapis.com
#
# So, we should probably not reference a new variable here, but enable
# the API for all our existing GCP projects and new GCP projects, and
# then reference var.project_id instead.
#
# But who knows, its hard to understand what's going on.
#
user_project_override = true
billing_project = var.billing_project_id
billing_project = var.project_id
}

data "google_client_config" "default" {}
Expand Down
4 changes: 4 additions & 0 deletions terraform/gcp/projects/2i2c-uk.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ project_id = "two-eye-two-see-uk"
zone = "europe-west2-b"
region = "europe-west2"

# This is the average of total costs for Apr -> Jun 2024 +20% in USD
budget_alert_amount = "830"
billing_account_id = "0157F7-E3EA8C-25AC3C"

k8s_versions = {
min_master_version : "1.29.1-gke.1589018",
core_nodes_version : "1.29.1-gke.1589018",
Expand Down
7 changes: 6 additions & 1 deletion terraform/gcp/projects/awi-ciroh-2.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,14 @@ region = "us-central1"
core_node_machine_type = "n2-highmem-4"
enable_network_policy = true
enable_filestore = true
filestore_capacity_gb = 2560
filestore_capacity_gb = 4915
enable_logging = false

# This is the average of total costs for Apr -> Jun 2024 +20% in USD
# This value is calculated from the OLD cluster we used to manage
budget_alert_amount = "1986"
billing_account_id = "01C45D-1F6147-63E18E"

k8s_versions = {
min_master_version : "1.29.4-gke.1043002",
core_nodes_version : "1.29.4-gke.1043002",
Expand Down
3 changes: 3 additions & 0 deletions terraform/gcp/projects/awi-ciroh.tfvars
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
billing_account_id = "0157F7-E3EA8C-25AC3C"
budget_alert_amount = "800"

prefix = "awi-ciroh"
project_id = "awi-ciroh"
zone = "us-central1-b"
Expand Down
4 changes: 4 additions & 0 deletions terraform/gcp/projects/catalystproject-latam.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ region = "southamerica-east1"
zone = "southamerica-east1-c"
enable_network_policy = true

# This is the average of total costs for Apr -> Jun 2024 +20% in USD
budget_alert_amount = "1672"
billing_account_id = "0157F7-E3EA8C-25AC3C"

k8s_versions = {
min_master_version : "1.29.1-gke.1589018",
core_nodes_version : "1.29.1-gke.1589018",
Expand Down
5 changes: 5 additions & 0 deletions terraform/gcp/projects/cloudbank.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@ zone = "us-central1-b"
region = "us-central1"
regional_cluster = false

# We don't have enough access to enable this
budget_alert_enabled = false
billing_account_id = ""
budget_alert_amount = ""

k8s_versions = {
# NOTE: This isn't a regional cluster / highly available cluster, when
# upgrading the control plane, there will be ~5 minutes of k8s not being
Expand Down
5 changes: 5 additions & 0 deletions terraform/gcp/projects/cluster.tfvars.template
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ project_id = "{{ project_id }}"
zone = "{{ cluster_region }}"
region = "{{ cluster_region }}"

# Config required to enable automatic budget alerts to be sent to [email protected]
budget_alert_enabled = false
budget_alert_amount = ""
billing_account_id = ""

# TODO: Before applying this, identify a k8s version to specify. Pick the latest
# k8s version from GKE's regular release channel. Look at the output
# called `regular_channel_latest_k8s_versions` as seen when using
Expand Down
4 changes: 4 additions & 0 deletions terraform/gcp/projects/hhmi.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ region = "us-west2"

core_node_machine_type = "n2-highmem-4"

# This is the average of total costs for Apr -> Jun 2024 +20% in USD
budget_alert_amount = "797"
billing_account_id = "0157F7-E3EA8C-25AC3C"

k8s_versions = {
min_master_version : "1.29.1-gke.1589020",
core_nodes_version : "1.29.1-gke.1589020",
Expand Down
6 changes: 6 additions & 0 deletions terraform/gcp/projects/leap.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ project_id = "leap-pangeo"
# prometheus requires more memory than a n2-highmem-2 can provide.
core_node_machine_type = "n2-highmem-4"

# The billing data for the community-LEAP-NSF account was zero for the last 3 months
# so choosing to disable this for this project
budget_alert_enabled = false
budget_alert_amount = ""
billing_account_id = "01A164-923D17-3199D9"

k8s_versions = {
min_master_version : "1.29.1-gke.1589018",
core_nodes_version : "1.29.1-gke.1589018",
Expand Down
4 changes: 4 additions & 0 deletions terraform/gcp/projects/linked-earth.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ project_id = "linked-earth-hubs"
zone = "us-central1-c"
region = "us-central1"

# This is the average of total costs for Apr -> Jun 2024 +20% in USD
budget_alert_amount = "540"
billing_account_id = "018C36-9A47B4-82AE21"

k8s_versions = {
min_master_version : "1.29.1-gke.1589018",
core_nodes_version : "1.29.1-gke.1589018",
Expand Down
7 changes: 6 additions & 1 deletion terraform/gcp/projects/pangeo-hubs.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,17 @@
#
prefix = "pangeo-hubs"
project_id = "pangeo-integration-te-3eea"
billing_project_id = "pangeo-integration-te-3eea"
zone = "us-central1-b"
region = "us-central1"
regional_cluster = false
core_node_machine_type = "n2-highmem-4"
enable_private_cluster = true
enable_logging = false

# We don't have enough rights to make billing alerts
budget_alert_enabled = false
budget_alert_amount = ""
billing_account_id = ""

k8s_versions = {
# NOTE: This isn't a regional cluster / highly available cluster, when
Expand Down
4 changes: 4 additions & 0 deletions terraform/gcp/projects/pilot-hubs.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ zone = "us-central1-b"
region = "us-central1"
regional_cluster = false

# This is the average of total costs for Apr -> Jun 2024 +20% in USD
budget_alert_amount = "1880"
billing_account_id = "0157F7-E3EA8C-25AC3C"

k8s_versions = {
# NOTE: This isn't a regional cluster / highly available cluster, when
# upgrading the control plane, there will be ~5 minutes of k8s not being
Expand Down
35 changes: 27 additions & 8 deletions terraform/gcp/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,36 @@ variable "project_id" {
EOT
}

variable "billing_project_id" {
variable "billing_account_id" {
type = string
default = "two-eye-two-see"
description = <<-EOT
This should be a GCP Project ID, not a GCP Billing Account ID as the name
indicates. It should be to a project that has a GCP API called Cloud Resource
Manager enabled. That can be enabled on a project via the link below:
https://console.cloud.google.com/apis/library/cloudresourcemanager.googleapis.com
ID of the billing account used for this project. Used to set up alerts
for budget forecasts.
EOT
}

variable "budget_alert_currency" {
type = string
default = "USD"
description = <<-EOT
Currency used for budget alerts.
EOT
}

What goes on here is confusing, see the comments about the confusion in main.tf
for more details.
variable "budget_alert_amount" {
type = string
description = <<-EOT
Amount of *forecasted spend* at which to send a billing alert. Current practice
is to set this to the average of the last 3 months expenditure + 20%.
EOT
}

variable "budget_alert_enabled" {
type = bool
default = true
description = <<-EOT
Enable budget alerts. Disable in cases where we do not have enough permissions
on the billing account or cloud account to enable APIs.
EOT
}

Expand Down

0 comments on commit 2d542b9

Please sign in to comment.