Skip to content

Commit

Permalink
Merge pull request #15 from infracloudio/f/docs
Browse files Browse the repository at this point in the history
Overhaul of docs
  • Loading branch information
saurabh3460 authored Nov 6, 2024
2 parents 11c2ffd + 2845852 commit 703d3f8
Show file tree
Hide file tree
Showing 4 changed files with 264 additions and 22 deletions.
35 changes: 13 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,32 +8,23 @@
</p>
<a href='https://codespaces.new/runwhen-contrib/codecollection-template?quickstart=1'><img src='https://github.com/codespaces/badge.svg' alt='Open in GitHub Codespaces' style='max-width: 100%;'></a>


# codecollection-template
A hello-world-style template for codecollection authors to get started writing codebundles. This template contains the minimum file structure expected by the RunWhen platform.

[![Build](https://github.com/runwhen-contrib/codecollection-template/actions/workflows/build.yaml/badge.svg)](https://github.com/runwhen-contrib/codecollection-template/actions/workflows/build.yaml)

## Getting Started
Looking to be a contributor for CodeCollections or start your own? We'd love to collaborate! Head on over to our [public docs](https://docs.runwhen.com/public/runwhen-authors/getting-started-with-codecollection-development) to get started.

File Structure overview of devcontainer:
```
-/app/
|- auth/ #store secrets here, it should already be properly gitignored for you
|- codecollection/
| |- codebundles/ # stores codebundles that can be run
| |- libraries/ # stores python keyword libraries used by codebundles
|- dev_facade/ # provides interfaces equivalent to those used on the platform, but just dry runs the keywords to assist with development
...
```
[Upstream Docs - CodeCollection Template](https://github.com/runwhen-contrib/codecollection-template/blob/main/README.md)

The included script `ro` wraps the `robot` RobotFramework binary, and includes some extra functionality to write logs to a consistent location for viewing in a HTTP server at http://localhost:3000/ that is always running as part of the devcontainer.
# InfraCloud RunWhen CodeCollection

### Quickstart
This CodeCollection aims to create a repository of CodeBundles that can address the various reproducible incident scenarios at [Infracloud/sre-stack](https://github.com/infracloudio/sre-stack/)

Navigate to the codebundle directory
`cd codecollection/codebundles/hello_world/`
- Set meaningful SLOs on Services and their dependencies
- DBs
- Queues
- Caches
- Gateways and proxies
- Create SLIs to continuosly monitor the health of services and dependencies
- Create mitigation runbooks in some scenarios where root-cause can be deterministically attested to

Run the codebundle
`ro sli.robot`
## Additional Docs
- [RunWhen Concepts](docs/runwhen/concepts.md)
- [Contributing to CodeCollections/CodeBundles](docs/runwhen/contrib.md)
94 changes: 94 additions & 0 deletions codebundles/rds-mysql-conn-count/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# CodeBundle - RDS MySQL Connection Count

This codebundle targets to detect and resolve an incident caused by too many sleeping connections in MySQL.

- Target Service - MySQL
- Cloud Platform - AWS/RDS

## SLX
```YAML
statement: RDS MySql connections should be within 80% of total max connection.
alias: RDS MySql Connections Count
metricType: gauge
asMeasuredBy: Score based on promethues query
icon: Cloud
owners:
- [email protected]
imageURL: >-
https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/kubernetes/resources/labeled/ns.svg
```
## SLO / Service Level Objective
Example:
```YAML
codeBundle:
repoUrl: https://github.com/infracloudio/ifc-rw-codecollection
pathToYaml: codebundles/slo-default/queries.yaml
ref: main
sloSpecType: simple-mwmb
objective: 95
threshold: 48
operand: lt
```
## SLI / Service Level Indicator
```YAML
displayUnitsLong: OK
displayUnitsShort: ok
locations:
- location-01-us-west1
description: >-
Watch RDS MySql connection count
codeBundle:
repoUrl: https://github.com/infracloudio/ifc-rw-codecollection
ref: main
pathToRobot: codebundles/rds-mysql-conn-count/sli.robot
# read more about intervalStrategy here: https://docs.runwhen.com/public/runwhen-platform/feature-overview/points-on-the-map-slxs/service-level-indicators-slis/interval-strategies
intervalStrategy: intermezzo
intervalSeconds: 30
configProvided:
# Change PROMETHEUS_HOSTNAME to your endpoint and currently endpoint needs to be publicly exposed.
- name: PROMETHEUS_HOSTNAME
value: >-
http://aeccfb7ff9bfb4705b6218294a7346c3-2081802229.us-west-2.elb.amazonaws.com/prometheus/api/v1
- name: QUERY
value: >-
aws_rds_database_connections_average{dimension_DBInstanceIdentifier="robotshopmysql"} > 1
- name: TRANSFORM
value: RAW
- name: STEP
value: '30'
- name: DATA_COLUMN
value: '1'
- name: NO_RESULT_OVERWRITE
value: 'Yes'
- name: NO_RESULT_VALUE
value: '0'
servicesProvided:
- name: curl
locationServiceName: curl-service.shared
```
## RunBook / Mitigation
```YAML
location: location-01-us-west1
codeBundle:
repoUrl: https://github.com/infracloudio/ifc-rw-codecollection
ref: main
pathToRobot: codebundles/rds-mysql-conn-count/runbook.robot
servicesProvided:
- name: curl
locationServiceName: curl-service.shared
configProvided:
- name: MYSQL_USER
value: admin
- name: MYSQL_HOST
value: robotshopmysql.example.us-west-2.rds.amazonaws.com
- name: PROCESS_USER
value: shipping
```
### Assumptions & Pitfalls
These configs are placeholder YAML. one needs to modify them according to need and then paste them to the platform side.
101 changes: 101 additions & 0 deletions docs/runwhen/concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# RunWhen Concepts
- [RunWhen Concepts](#runwhen-concepts)
- [Runwhen Local](#runwhen-local)
- [CheatSheet Generator](#cheatsheet-generator)
- [Uploading Cluster Topology to the Platform](#uploading-cluster-topology-to-the-platform)
- [CodeCollections](#codecollections)
- [CodeBundles](#codebundles)

# Runwhen Local
- [source-code](https://github.com/runwhen-contrib/runwhen-local)
- [Helm Chart](https://github.com/runwhen-contrib/helm-charts/tree/main/charts/runwhen-local)
- [Upstream docs](https://docs.runwhen.com/public/v/runwhen-local/)

RunWhen Local has two core functions:
- Generate remediation scripts / CheatSheets from included templates for your local cluster
- Upload Cluster Topology to the RunWhen Platform

## CheatSheet Generator
At the moment RunWhen Local **does not posses the ability to discover issues** in
your cluster and suggest mitigation runbooks / codebundles.

**However, it discovers your kubernetes resources and object names.**
Using which, it generates a wide set of runbooks for you, if you already know the
root cause. These runbooks contain documentation and pastable shell script
snippets for the searched issue. These scripts / cheatsheet are already pre-templated
with your namespaces and kubernetes resource names.

This collection of cheatsheets / runbooks, although not exhaustive, covers a significant portion
of recurring issues and healthcheck failures and can be useful to SREs for quick
resolution of incidents.

[Upstream Examples](https://docs.runwhen.com/public/v/runwhen-local/user-guide/features/user_guide-feature_overview)

## Uploading Cluster Topology to the Platform
The second core function of runwhen-local is to upload cluster topology to the
runwhen platform so you can visualize the cluster workload map from a configured
runwhen workspace.

- First, follow documentation at [Upload to RunWhen Platform](https://docs.runwhen.com/public/v/runwhen-local/user-guide/features/upload-to-runwhen-platform#upload-from-the-cli)
- To generate the `uploadInfo.yaml` file
- Next, take the yaml object and copy over it's contents to `uploadInfo:[]` section
of the helm [`values.yaml` file](https://github.com/runwhen-contrib/helm-charts/blob/main/charts/runwhen-local/values.yaml#L121)
- Once configured it should look like this:
```YAML
uploadInfo:
workspaceName: <your-workspace-name>
token: <your token> # Do NOT add token and commit to git
workspaceOwnerEmail: [email protected]
papiURL: https://papi.beta.runwhen.com
defaultLocation: location-01-us-west1 # available runwhen locations
```
- You should pass the token from helm cli, to ensure you are not leaking the token via git
```bash
helm upgrade --install ${HELM_RELEASE_NAME} runwhen-contrib/runwhen-local \
--set uploadInfo.token=${RUNWHEN_PLATFORM_TOKEN} \
-f ${VALUES_FILE} -n ${NAMESPACE}
```

# CodeCollections
CodeCollections are a group of CodeBundles that can be referenced and used in RunWhen Platform.

*N.B. It's important to note here that currently codecollections cannot be imported explicitly and run against your local cluster using RunWhen Local*

Currently RunWhen has published two codecollections:
- [runwhen-public-codecollection](https://github.com/runwhen-contrib/rw-public-codecollection)
- These contain codebundles that are usually run against services and doesn't involved a Shell / CLI component
- [runwhen-cli-codecollection](https://github.com/runwhen-contrib/rw-cli-codecollection)
- These are generally targeted towards SRE workloads and wraps various shell-scripts and CLI tooling.

# CodeBundles
CodeBundles are specific detectors/mitigators of known SLI/SLO violations in a live software stack.

It comprises of:
- Robot files
- Scripts / Playbooks / tasksets written using [Robot Framework](), that either
- Create and enforce RunWhen SLIs - `sli.robot`
- Create miitigation runbooks in response to an SLO/SLI violation - `runbook.robot`
- Platform definitions of `{SLX, SLO, SLI, Runbook}` as `YAML` configurations
- These do not need to be located in your repo, however it's a good practice to have them committed in git.
- These configurations wrap standard behaviors for interacting with RunWhen Platform API, `papi`
- Endpoint: `https://papi.beta.runwhen.com`
- The RunWhen `YAML` configurations are only pertinent when your codebundle is live on RunWhen Platform, these do not play any role as of now for either local testing or RunWhen Local.
- Test resources / scripts

In a local testing environment you only need to execute the `*.robot` files inside the provided container configurations,
- [Dockerfile](../../Dockerfile)
- [vscode/devcontainer](../../.devcontainer.json)


The usual call chain is as follows:
- Robot Scripts
- User variable and secret injection
- Runwhen Libraries
- RunWhen Services
- Wrapped shell CLI command / Platform SDK code execution
- or, direct shims to your shell scripts / python code when services are unavailable
- These tasks fetch the current value of a metric / state
- This metric value is then compared against the defined thresholds at `sli/slo.yaml` in the platform.
- If the Robot script just runs a set of tasks as a mitigation step, it returns either success or failure.

More concepts and non-trivial FAQs around writing CodeBundles are explained at [Contributing to CodeCollections/CodeBundles](contrib.md)
56 changes: 56 additions & 0 deletions docs/runwhen/contrib.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Contributing to CodeCollections/CodeBundles

## Creating a New CodeCollection
### Forking the template repository

## Writing a Non-trivial CodeBundle
### Directory structure / Scaffolding



#########
Repository Setup
Introduction to Robot Framework Scripts (how it interacts with RunWhen)
Calling bash with relative paths
Secret handling
Suite Initialization
Library usage
Explain the call chain
Library Setup
How to get an exhaustive list of available libraries
CLI repo
Public repo
Explain what libraries would be auto-fetched by devcontainer tooling
Core
CLI
What needs to be added for specific libraries that are used in a robot script
Paths
Running a test with local docker
Adding additional binaries to devcontainer as needed
Mysql-client
Postgres-client
Redis-client
Configuring Env / secrets
Expose endpoints
Local docker network
Expose from test cluster
Test by using docker run on localhost
Test in your live environment
Deploy as a k8s job
Give an example
Testing on Runwhen Platform
Connecting test env/cluster to runwhen
Runwhen-local upload
If Robot script needs to use additional dependencies, like CLI tools the devs need to be informed and for now they will handle the update on platform side
Mysql-client
Postgres-client
Redis-client
Registering your first codecollection to Runwhen-platform
Mention that this may be in private as per developer discretion
How to configure the YAML to test
Branch name length limitations
Expose metric endpoints so that they are accessible to runwhen-platform codebundles
Configuring Env / secrets
Running the test
Checking logs
Checking for errors

0 comments on commit 703d3f8

Please sign in to comment.