Paper | Incident Scenarios | Tools | Maintainers
ITBench uses open source technologies to create completely repeatable and reproducible scenarios on a Kubernetes platform. A scenario involves deploying a set of observability tools, a sample application and triggering an incident (referred to as task) in the environment.
While this repository focuses on scenarios, an open-source Language Model (LM)-based SRE-Agent that aims to diagnose and remediate issues in these scenario environments can be found here.
This project uses Ansible to automate the deployment and undeployment of technologies to a Kubernetes cluster and the injection and removal of faults.
The playbook run is configured using variables defined in group\_vars
.
Directory | Purpose |
---|---|
roles/observability_tools |
Handles the deployment and removal of observability tools |
roles/sample_applications |
Handles the deployment and removal of sample applications |
roles/fault_injection |
Provides reusable fault injection mechanisms |
roles/fault_removal |
Provides mechanisms to remove (injected) faults from the environment |
roles/incident_ |
Contains scenarios that leverage the fault injection and removal mechanisms defined in the directories above |
-
Install Homebrew
-
Install required software
brew install helm
brew install kubectl
brew install python@3.12
- Create a Python virtual environment
python3.12 -m venv venv
source venv/bin/activate
- Install Python dependencies
python -m pip install -r requirements.txt
- Install Ansible collections.
ansible-galaxy install -r requirements.yaml
Note: These steps only need to be done once upon the initial set up. Note: Depending on what kind of cluster setup is needed, further dependencies may need to be installed. Please see the below section for further details.
For instruction on how to create a kind cluster, please see the instructions here.
For instruction on how to create an cloud provider based Kubernetes cluster, please see the instructions here.
Currently, only AWS is supported. AWS clusters are provisioned using kOps.
Now that our cluster is up and running, let's proceed with the deployment of the observability tools and application stack, injecting the fault, and monitoring of alerts in the Grafana dashboard.
- Create the
all.yaml
file from the template and update thekubeconfig
field with the path to the configuration of the Kubernetes cluster. While the file creation needs only to be done once, thekubeconfig
field must be updated if the file path changes or the cluster you intend to leverage changes.
cp group_vars/all.yaml.example group_vars/all.yaml
- Deploy the observability tools.
make deploy_observability_stack
The observability tools deployment includes Prometheus, Grafana, Loki, Elasticsearch, Jaeger, OpenSearch and K8s-events-exporter. For additional details on the observability tools deployed please head here.
- Deploy one of the sample applications. In this case we are deploying OpenTelemetery's Astronomy Shop Demo.
make deploy_astronomy_shop
Currently IT-Bench supports two sample applications--OpenTelemetery's Astronomy Shop Demo and Deathstartbench's Hotel Reservation. For additional details on the sample applications please head here.
- Once all pods are running, inject the fault for an incident.
INCIDENT_NUMBER=1 make inject_incident_fault
Currently the incident scenarios open-sourced are incidents 1, 3, 23, 26, 27, and 102. One can leverage any one of these incidents at this point in time in their own environemnts. Additional details on the incident scenarios themselves and the fault mechanisms can be found [here].
- After fault injection, to view alerts in the grafana dashboard, use Port Forward to access the Grafana service.
kubectl port-forward svc/ingress-nginx-controller -n ingress-nginx 8080:80
- To view Grafana dashboard in your web browser, use the following URL:
http://localhost:8080/prometheus/alerting/list
- In the right panel, under the
Grafana
section, click on theAstronomyNotifications
folder to view the alerts on the dashboard. Four alerts are defined:
- To track
error
across the different services - To track
latency
across the different services - To track status of deployments across the different namespaces
- To track Kafka connection status across the Kafka-related components
An Alert's default
State
isNormal
. After few minutes, the faultState
changes toFiring
, indicating fault manifestation. The alert definitions for Grafana located here and has been curated using this guide.
- (Optional) You only need to do this if you plan to leverage our SRE-Agent. Port forward the topology mapper service by running.
kubectl -n kube-system port-forward svc/topology-monitor 8081:8080
- (Optional) You only need to do this if you plan to leverage our SRE-Agent. Leverage the values below for the
.env.tmpl
GRAFANA_URL=http://localhost:8080/prometheus
TOPOLOGY_URL=http://localhost:8081
- To remove the injected fault, run the following
make
command:
INCIDENT_NUMBER=1 make remove_incident_fault
After executing the command, the alert's State
should change back to Normal
from Firing
, indicating that the fault has been removed.
- Once done you can undeploy the observability, followed by the application stack by running:
make undeploy_astronomy_shop
make undeploy_observability_stack
Note: For a full list of make
commands, run the following command:
make help
- Mudit Verma - @mudverma
- Divya Pathak - @divyapathak24
- Felix George - @fali007
- Ting Dai - @tingdai
- Gerard Vanloo - @Red-GV
- Bekir O Turkkan - @bekiroguzhan