Merge pull request #15 from infracloudio/f/docs

Overhaul of docs
infracloudio · Nov 6, 2024 · 703d3f8 · 703d3f8
2 parents 11c2ffd + 2845852
commit 703d3f8
Show file tree

Hide file tree

Showing 4 changed files with 264 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -8,32 +8,23 @@
 </p>
 <a href='https://codespaces.new/runwhen-contrib/codecollection-template?quickstart=1'><img src='https://github.com/codespaces/badge.svg' alt='Open in GitHub Codespaces' style='max-width: 100%;'></a>
 
-
-# codecollection-template
-A hello-world-style template for codecollection authors to get started writing codebundles. This template contains the minimum file structure expected by the RunWhen platform.
-
 [![Build](https://github.com/runwhen-contrib/codecollection-template/actions/workflows/build.yaml/badge.svg)](https://github.com/runwhen-contrib/codecollection-template/actions/workflows/build.yaml)
 
-## Getting Started
-Looking to be a contributor for CodeCollections or start your own? We'd love to collaborate! Head on over to our [public docs](https://docs.runwhen.com/public/runwhen-authors/getting-started-with-codecollection-development) to get started.
 
-File Structure overview of devcontainer:
-```
--/app/
-    |- auth/ #store secrets here, it should already be properly gitignored for you
-    |- codecollection/
-    |   |- codebundles/ # stores codebundles that can be run
-    |   |- libraries/ # stores python keyword libraries used by codebundles
-    |- dev_facade/ # provides interfaces equivalent to those used on the platform, but just dry runs the keywords to assist with development
-    ...
-```
+[Upstream Docs - CodeCollection Template](https://github.com/runwhen-contrib/codecollection-template/blob/main/README.md)
 
-The included script `ro` wraps the `robot` RobotFramework binary, and includes some extra functionality to write logs to a consistent location for viewing in a HTTP server at http://localhost:3000/ that is always running as part of the devcontainer.
+# InfraCloud RunWhen CodeCollection
 
-### Quickstart
+This CodeCollection aims to create a repository of CodeBundles that can address the various reproducible incident scenarios at [Infracloud/sre-stack](https://github.com/infracloudio/sre-stack/)
 
-Navigate to the codebundle directory
-`cd codecollection/codebundles/hello_world/`
+- Set meaningful SLOs on Services and their dependencies
+  - DBs
+  - Queues
+  - Caches
+  - Gateways and proxies
+- Create SLIs to continuosly monitor the health of services and dependencies
+- Create mitigation runbooks in some scenarios where root-cause can be deterministically attested to
 
-Run the codebundle
-`ro sli.robot`
+## Additional Docs
+- [RunWhen Concepts](docs/runwhen/concepts.md)
+- [Contributing to CodeCollections/CodeBundles](docs/runwhen/contrib.md)
diff --git a/codebundles/rds-mysql-conn-count/README.md b/codebundles/rds-mysql-conn-count/README.md
@@ -0,0 +1,94 @@
+# CodeBundle - RDS MySQL Connection Count
+
+This codebundle targets to detect and resolve an incident caused by too many sleeping connections in MySQL.
+
+- Target Service - MySQL
+- Cloud Platform - AWS/RDS
+
+## SLX
+```YAML
+statement: RDS MySql connections should be within 80% of total max connection.
+alias: RDS MySql Connections Count
+metricType: gauge
+asMeasuredBy: Score based on promethues query
+icon: Cloud
+owners:
+  - [email protected]
+imageURL: >-
+  https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/kubernetes/resources/labeled/ns.svg
+
+```
+## SLO / Service Level Objective
+Example:
+```YAML
+codeBundle:
+  repoUrl: https://github.com/infracloudio/ifc-rw-codecollection
+  pathToYaml: codebundles/slo-default/queries.yaml
+  ref: main
+sloSpecType: simple-mwmb
+objective: 95
+threshold: 48
+operand: lt
+```
+
+## SLI / Service Level Indicator
+```YAML
+displayUnitsLong: OK
+displayUnitsShort: ok
+locations:
+  - location-01-us-west1
+description: >-
+  Watch RDS MySql connection count
+codeBundle:
+  repoUrl: https://github.com/infracloudio/ifc-rw-codecollection
+  ref: main
+  pathToRobot: codebundles/rds-mysql-conn-count/sli.robot
+# read more about intervalStrategy here: https://docs.runwhen.com/public/runwhen-platform/feature-overview/points-on-the-map-slxs/service-level-indicators-slis/interval-strategies
+intervalStrategy: intermezzo
+intervalSeconds: 30
+configProvided:
+  # Change PROMETHEUS_HOSTNAME to your endpoint and currently endpoint needs to be publicly exposed.
+  - name: PROMETHEUS_HOSTNAME
+    value: >-
+      http://aeccfb7ff9bfb4705b6218294a7346c3-2081802229.us-west-2.elb.amazonaws.com/prometheus/api/v1
+  - name: QUERY
+    value: >-
+      aws_rds_database_connections_average{dimension_DBInstanceIdentifier="robotshopmysql"} > 1
+  - name: TRANSFORM
+    value: RAW
+  - name: STEP
+    value: '30'
+  - name: DATA_COLUMN
+    value: '1'
+  - name: NO_RESULT_OVERWRITE
+    value: 'Yes'
+  - name: NO_RESULT_VALUE
+    value: '0'
+servicesProvided:
+  - name: curl
+    locationServiceName: curl-service.shared
+```
+
+## RunBook / Mitigation
+
+```YAML
+location: location-01-us-west1
+codeBundle:
+  repoUrl: https://github.com/infracloudio/ifc-rw-codecollection
+  ref: main
+  pathToRobot: codebundles/rds-mysql-conn-count/runbook.robot
+servicesProvided:
+  - name: curl
+    locationServiceName: curl-service.shared
+configProvided:
+  - name: MYSQL_USER
+    value: admin
+  - name: MYSQL_HOST
+    value: robotshopmysql.example.us-west-2.rds.amazonaws.com
+  - name: PROCESS_USER
+    value: shipping
+```
+
+### Assumptions & Pitfalls
+
+These configs are placeholder YAML. one needs to modify them according to need and then paste them to the platform side.
diff --git a/docs/runwhen/concepts.md b/docs/runwhen/concepts.md
@@ -0,0 +1,101 @@
+# RunWhen Concepts
+- [RunWhen Concepts](#runwhen-concepts)
+- [Runwhen Local](#runwhen-local)
+  - [CheatSheet Generator](#cheatsheet-generator)
+  - [Uploading Cluster Topology to the Platform](#uploading-cluster-topology-to-the-platform)
+- [CodeCollections](#codecollections)
+- [CodeBundles](#codebundles)
+
+# Runwhen Local
+- [source-code](https://github.com/runwhen-contrib/runwhen-local)
+- [Helm Chart](https://github.com/runwhen-contrib/helm-charts/tree/main/charts/runwhen-local)
+- [Upstream docs](https://docs.runwhen.com/public/v/runwhen-local/)
+
+RunWhen Local has two core functions:
+- Generate remediation scripts / CheatSheets from included templates for your local cluster
+- Upload Cluster Topology to the RunWhen Platform
+
+## CheatSheet Generator
+At the moment RunWhen Local **does not posses the ability to discover issues** in
+your cluster and suggest mitigation runbooks / codebundles. 
+
+**However, it discovers your kubernetes resources and object names.**
+Using which, it generates a wide set of runbooks for you, if you already know the
+root cause. These runbooks contain documentation and pastable shell script
+snippets for the searched issue. These scripts / cheatsheet are already pre-templated
+with your namespaces and kubernetes resource names.
+
+This collection of cheatsheets / runbooks, although not exhaustive, covers a significant portion
+of recurring issues and healthcheck failures and can be useful to SREs for quick
+resolution of incidents.
+
+[Upstream Examples](https://docs.runwhen.com/public/v/runwhen-local/user-guide/features/user_guide-feature_overview)
+
+## Uploading Cluster Topology to the Platform
+The second core function of runwhen-local is to upload cluster topology to the
+runwhen platform so you can visualize the cluster workload map from a configured
+runwhen workspace.
+
+- First, follow documentation at [Upload to RunWhen Platform](https://docs.runwhen.com/public/v/runwhen-local/user-guide/features/upload-to-runwhen-platform#upload-from-the-cli)
+  - To generate the `uploadInfo.yaml` file
+- Next, take the yaml object and copy over it's contents to `uploadInfo:[]` section
+of the helm [`values.yaml` file](https://github.com/runwhen-contrib/helm-charts/blob/main/charts/runwhen-local/values.yaml#L121)
+- Once configured it should look like this:
+  ```YAML
+  uploadInfo:
+    workspaceName: <your-workspace-name>
+    token: <your token> # Do NOT add token and commit to git
+    workspaceOwnerEmail: [email protected]
+    papiURL: https://papi.beta.runwhen.com
+    defaultLocation: location-01-us-west1 # available runwhen locations
+  ```
+- You should pass the token from helm cli, to ensure you are not leaking the token via git
+  ```bash
+  helm upgrade --install  ${HELM_RELEASE_NAME} runwhen-contrib/runwhen-local \
+    --set uploadInfo.token=${RUNWHEN_PLATFORM_TOKEN} \
+     -f ${VALUES_FILE} -n ${NAMESPACE}
+  ```
+
+# CodeCollections
+CodeCollections are a group of CodeBundles that can be referenced and used in RunWhen Platform.
+
+*N.B. It's important to note here that currently codecollections cannot be imported explicitly and run against your local cluster using RunWhen Local*
+
+Currently RunWhen has published two codecollections:
+- [runwhen-public-codecollection](https://github.com/runwhen-contrib/rw-public-codecollection)
+  - These contain codebundles that are usually run against services and doesn't involved a Shell / CLI component
+- [runwhen-cli-codecollection](https://github.com/runwhen-contrib/rw-cli-codecollection)
+  - These are generally targeted towards SRE workloads and wraps various shell-scripts and CLI tooling.
+
+# CodeBundles
+CodeBundles are specific detectors/mitigators of known SLI/SLO violations in a live software stack.
+
+It comprises of:
+- Robot files
+  - Scripts / Playbooks / tasksets written using [Robot Framework](), that either
+    - Create and enforce RunWhen SLIs - `sli.robot`
+    - Create miitigation runbooks in response to an SLO/SLI violation - `runbook.robot`
+- Platform definitions of `{SLX, SLO, SLI, Runbook}` as `YAML` configurations
+  - These do not need to be located in your repo, however it's a good practice to have them committed in git.
+  - These configurations wrap standard behaviors for interacting with RunWhen Platform API, `papi`
+    - Endpoint: `https://papi.beta.runwhen.com`
+  - The RunWhen `YAML` configurations are only pertinent when your codebundle is live on RunWhen Platform, these do not play any role as of now for either local testing or RunWhen Local.
+- Test resources / scripts
+
+In a local testing environment you only need to execute the `*.robot` files inside the provided container configurations,
+- [Dockerfile](../../Dockerfile)
+- [vscode/devcontainer](../../.devcontainer.json)
+
+
+The usual call chain is as follows:
+- Robot Scripts
+  - User variable and secret injection
+  - Runwhen Libraries
+    - RunWhen Services 
+      -  Wrapped shell CLI command / Platform SDK code execution
+      -  or, direct shims to your shell scripts / python code when services are unavailable
+      -  These tasks fetch the current value of a metric / state
+         -  This metric value is then compared against the defined thresholds at `sli/slo.yaml` in the platform.
+      - If the Robot script just runs a set of tasks as a mitigation step, it returns either success or failure.
+
+More concepts and non-trivial FAQs around writing CodeBundles are explained at [Contributing to CodeCollections/CodeBundles](contrib.md)
diff --git a/docs/runwhen/contrib.md b/docs/runwhen/contrib.md
@@ -0,0 +1,56 @@
+# Contributing to CodeCollections/CodeBundles
+
+## Creating a New CodeCollection
+### Forking the template repository
+
+## Writing a Non-trivial CodeBundle
+### Directory structure / Scaffolding
+
+
+
+#########
+Repository Setup
+Introduction to Robot Framework Scripts (how it interacts with RunWhen)
+Calling bash with relative paths
+Secret handling
+Suite Initialization
+Library usage
+Explain the call chain
+Library Setup
+How to get an exhaustive list of available libraries
+CLI repo
+Public repo
+Explain what libraries would be auto-fetched by devcontainer tooling
+Core
+CLI
+What needs to be added for specific libraries that are used in a robot script
+Paths
+Running a test with local docker
+Adding additional binaries to devcontainer as needed
+Mysql-client
+Postgres-client
+Redis-client
+Configuring Env / secrets
+Expose endpoints
+Local docker network
+Expose from test cluster
+Test by using docker run on localhost
+Test in your live environment
+Deploy as a k8s job
+Give an example
+Testing on Runwhen Platform
+Connecting test env/cluster to runwhen
+Runwhen-local upload
+If Robot script needs to use additional dependencies, like CLI tools the devs need to be informed and for now they will handle the update on platform side
+Mysql-client
+Postgres-client
+Redis-client
+Registering your first codecollection to Runwhen-platform
+Mention that this may be in private as per developer discretion
+How to configure the YAML to test
+Branch name length limitations
+Expose metric endpoints so that they are accessible to runwhen-platform codebundles
+Configuring Env / secrets
+Running the test
+Checking logs
+Checking for errors