Skip to content

Latest commit

 

History

History
 
 

sre

ITBench for Site Reliability Engineering (SRE) and Financial Operations (FinOps)

Paper | Incident Scenarios | Tools | Maintainers

Overview

ITBench uses open source technologies to create completely repeatable and reproducible scenarios on a Kubernetes platform. A scenario involves deploying a set of observability tools, a sample application and triggering an incident (referred to as task) in the environment.

itbench_sre_task_scenario.png While this repository focuses on scenarios, an open-source Language Model (LM)-based SRE-Agent that aims to diagnose and remediate issues in these scenario environments can be found here.

Project Structure

This project uses Ansible to automate the deployment and undeployment of technologies to a Kubernetes cluster and the injection and removal of faults. The playbook run is configured using variables defined in group\_vars.

Directory Purpose
roles/observability_tools Handles the deployment and removal of observability tools
roles/sample_applications Handles the deployment and removal of sample applications
roles/fault_injection Provides reusable fault injection mechanisms
roles/fault_removal Provides mechanisms to remove (injected) faults from the environment
roles/incident_ Contains scenarios that leverage the fault injection and removal mechanisms defined in the directories above

Recommended Software

MacOS

Required Software

Installing Required Software via Homebrew (MacOS)

  1. Install Homebrew

  2. Install required software

brew install helm
brew install kubectl
brew install python@3.12

Getting Started – Deploying an Incident Scenario

Installing Dependencies

  1. Create a Python virtual environment
python3.12 -m venv venv
source venv/bin/activate
  1. Install Python dependencies
python -m pip install -r requirements.txt
  1. Install Ansible collections.
ansible-galaxy install -r requirements.yaml

Note: These steps only need to be done once upon the initial set up. Note: Depending on what kind of cluster setup is needed, further dependencies may need to be installed. Please see the below section for further details.

Cluster Setup

Local Cluster

For instruction on how to create a kind cluster, please see the instructions here.

Remote Cluster

For instruction on how to create an cloud provider based Kubernetes cluster, please see the instructions here.

Currently, only AWS is supported. AWS clusters are provisioned using kOps.

Running the Incident Scenarios

Now that our cluster is up and running, let's proceed with the deployment of the observability tools and application stack, injecting the fault, and monitoring of alerts in the Grafana dashboard.

  1. Create the all.yaml file from the template and update the kubeconfig field with the path to the configuration of the Kubernetes cluster. While the file creation needs only to be done once, the kubeconfig field must be updated if the file path changes or the cluster you intend to leverage changes.
cp group_vars/all.yaml.example group_vars/all.yaml
  1. Deploy the observability tools.
make deploy_observability_stack

The observability tools deployment includes Prometheus, Grafana, Loki, Elasticsearch, Jaeger, OpenSearch and K8s-events-exporter. For additional details on the observability tools deployed please head here.

  1. Deploy one of the sample applications. In this case we are deploying OpenTelemetery's Astronomy Shop Demo.
make deploy_astronomy_shop

Currently IT-Bench supports two sample applications--OpenTelemetery's Astronomy Shop Demo and Deathstartbench's Hotel Reservation. For additional details on the sample applications please head here.

  1. Once all pods are running, inject the fault for an incident.
INCIDENT_NUMBER=1 make inject_incident_fault

Currently the incident scenarios open-sourced are incidents 1, 3, 23, 26, 27, and 102. One can leverage any one of these incidents at this point in time in their own environemnts. Additional details on the incident scenarios themselves and the fault mechanisms can be found [here].

  1. After fault injection, to view alerts in the grafana dashboard, use Port Forward to access the Grafana service.
kubectl port-forward svc/ingress-nginx-controller -n ingress-nginx 8080:80
  1. To view Grafana dashboard in your web browser, use the following URL:
http://localhost:8080/prometheus/alerting/list
  1. In the right panel, under the Grafana section, click on the AstronomyNotifications folder to view the alerts on the dashboard. Four alerts are defined:
  • To track error across the different services
  • To track latency across the different services
  • To track status of deployments across the different namespaces
  • To track Kafka connection status across the Kafka-related components An Alert's default State is Normal. After few minutes, the fault State changes to Firing, indicating fault manifestation. The alert definitions for Grafana located here and has been curated using this guide.
  1. (Optional) You only need to do this if you plan to leverage our SRE-Agent. Port forward the topology mapper service by running.
kubectl -n kube-system port-forward svc/topology-monitor 8081:8080
  1. (Optional) You only need to do this if you plan to leverage our SRE-Agent. Leverage the values below for the .env.tmpl
GRAFANA_URL=http://localhost:8080/prometheus
TOPOLOGY_URL=http://localhost:8081
  1. To remove the injected fault, run the following make command:
INCIDENT_NUMBER=1 make remove_incident_fault

After executing the command, the alert's State should change back to Normal from Firing, indicating that the fault has been removed.

  1. Once done you can undeploy the observability, followed by the application stack by running:
make undeploy_astronomy_shop
make undeploy_observability_stack

Note: For a full list of make commands, run the following command:

make help

Maintainers