Name		Name	Last commit message	Last commit date
parent directory ..
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
collections		collections
cron_jobs		cron_jobs
docs		docs
group_vars		group_vars
local_cluster		local_cluster
remote_cluster		remote_cluster
roles		roles
tools/k8s-topology-mapper		tools/k8s-topology-mapper
.gitignore		.gitignore
.gitmodules		.gitmodules
.whitesource		.whitesource
Makefile		Makefile
README.md		README.md
ansible.cfg		ansible.cfg
base.yaml		base.yaml
requirements-freeze.txt		requirements-freeze.txt
requirements.txt		requirements.txt
requirements.yaml		requirements.yaml
set_env_vars.sh		set_env_vars.sh

README.md

ITBench for Site Reliability Engineering (SRE) and Financial Operations (FinOps)

Paper | Incident Scenarios | Tools | Maintainers

Overview

ITBench uses open source technologies to create completely repeatable and reproducible scenarios on a Kubernetes platform. A scenario involves deploying a set of observability tools, a sample application and triggering an incident (referred to as task) in the environment.

While this repository focuses on scenarios, an open-source Language Model (LM)-based SRE-Agent that aims to diagnose and remediate issues in these scenario environments can be found here.

Project Structure

This project uses Ansible to automate the deployment and undeployment of technologies to a Kubernetes cluster and the injection and removal of faults. The playbook run is configured using variables defined in group\_vars.

Directory	Purpose
`roles/observability_tools`	Handles the deployment and removal of observability tools
`roles/sample_applications`	Handles the deployment and removal of sample applications
`roles/fault_injection`	Provides reusable fault injection mechanisms
`roles/fault_removal`	Provides mechanisms to remove (injected) faults from the environment
`roles/incident_`	Contains scenarios that leverage the fault injection and removal mechanisms defined in the directories above

Recommended Software

MacOS

Homebrew

Required Software

Python3 (v3.12.Z)
Helm (v3.16+)
Kubectl

Installing Required Software via Homebrew (MacOS)

Install Homebrew
Install required software

brew install helm
brew install kubectl
brew install python@3.12

Getting Started – Deploying an Incident Scenario

Installing Dependencies

Create a Python virtual environment

python3.12 -m venv venv
source venv/bin/activate

Install Python dependencies

python -m pip install -r requirements.txt

Install Ansible collections.

ansible-galaxy install -r requirements.yaml

Note: These steps only need to be done once upon the initial set up. Note: Depending on what kind of cluster setup is needed, further dependencies may need to be installed. Please see the below section for further details.

Cluster Setup

Local Cluster

For instruction on how to create a kind cluster, please see the instructions here.

Remote Cluster

For instruction on how to create an cloud provider based Kubernetes cluster, please see the instructions here.

Currently, only AWS is supported. AWS clusters are provisioned using kOps.

Running the Incident Scenarios

Now that our cluster is up and running, let's proceed with the deployment of the observability tools and application stack, injecting the fault, and monitoring of alerts in the Grafana dashboard.

Create the all.yaml file from the template and update the kubeconfig field with the path to the configuration of the Kubernetes cluster. While the file creation needs only to be done once, the kubeconfig field must be updated if the file path changes or the cluster you intend to leverage changes.

cp group_vars/all.yaml.example group_vars/all.yaml

Deploy the observability tools.

make deploy_observability_stack

The observability tools deployment includes Prometheus, Grafana, Loki, Elasticsearch, Jaeger, OpenSearch and K8s-events-exporter. For additional details on the observability tools deployed please head here.

Deploy one of the sample applications. In this case we are deploying OpenTelemetery's Astronomy Shop Demo.

make deploy_astronomy_shop

Currently IT-Bench supports two sample applications--OpenTelemetery's Astronomy Shop Demo and Deathstartbench's Hotel Reservation. For additional details on the sample applications please head here.

Once all pods are running, inject the fault for an incident.

INCIDENT_NUMBER=1 make inject_incident_fault

Currently the incident scenarios open-sourced are incidents 1, 3, 23, 26, 27, and 102. One can leverage any one of these incidents at this point in time in their own environemnts. Additional details on the incident scenarios themselves and the fault mechanisms can be found [here].

After fault injection, to view alerts in the grafana dashboard, use Port Forward to access the Grafana service.

kubectl port-forward svc/ingress-nginx-controller -n ingress-nginx 8080:80

To view Grafana dashboard in your web browser, use the following URL:

http://localhost:8080/prometheus/alerting/list

In the right panel, under the Grafana section, click on the AstronomyNotifications folder to view the alerts on the dashboard. Four alerts are defined:

To track error across the different services
To track latency across the different services
To track status of deployments across the different namespaces
To track Kafka connection status across the Kafka-related components An Alert's default State is Normal. After few minutes, the fault State changes to Firing, indicating fault manifestation. The alert definitions for Grafana located here and has been curated using this guide.

(Optional) You only need to do this if you plan to leverage our SRE-Agent. Port forward the topology mapper service by running.

kubectl -n kube-system port-forward svc/topology-monitor 8081:8080

(Optional) You only need to do this if you plan to leverage our SRE-Agent. Leverage the values below for the .env.tmpl

GRAFANA_URL=http://localhost:8080/prometheus
TOPOLOGY_URL=http://localhost:8081

To remove the injected fault, run the following make command:

INCIDENT_NUMBER=1 make remove_incident_fault

After executing the command, the alert's State should change back to Normal from Firing, indicating that the fault has been removed.

Once done you can undeploy the observability, followed by the application stack by running:

make undeploy_astronomy_shop
make undeploy_observability_stack

Note: For a full list of make commands, run the following command:

make help

Maintainers

Mudit Verma - @mudverma
Divya Pathak - @divyapathak24
Felix George - @fali007
Ting Dai - @tingdai
Gerard Vanloo - @Red-GV
Bekir O Turkkan - @bekiroguzhan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sre

sre

README.md

ITBench for Site Reliability Engineering (SRE) and Financial Operations (FinOps)

Overview

Project Structure

Recommended Software

MacOS

Required Software

Installing Required Software via Homebrew (MacOS)

Getting Started – Deploying an Incident Scenario

Installing Dependencies

Cluster Setup

Local Cluster

Remote Cluster

Running the Incident Scenarios

Maintainers

Files

sre

Directory actions

More options

Directory actions

More options

Latest commit

History

sre

Folders and files

parent directory

README.md

ITBench for Site Reliability Engineering (SRE) and Financial Operations (FinOps)

Overview

Project Structure

Recommended Software

MacOS

Required Software

Installing Required Software via Homebrew (MacOS)

Getting Started – Deploying an Incident Scenario

Installing Dependencies

Cluster Setup

Local Cluster

Remote Cluster

Running the Incident Scenarios

Maintainers