Skip to content

Web application and supporting CLI tools for analyzing OpenShift Dedicated cluster alert data.

License

Notifications You must be signed in to change notification settings

anispate/osd-alert-analysis

 
 

Repository files navigation

OSD Alert Analysis

Web application and supporting CLI tools for analyzing OpenShift Dedicated PagerDuty alert data.

Requirements

  • Python 3.9+
    • Don't forget to run pip3 install -r requirements.txt*
  • a MariaDB/MySQL server*
  • a PagerDuty API token
  • Docker/Podman/K8s/OpenShift (optional)

*If pip complains about being unable to find a specific version of a dependency module, it's probably because you're using a version of Python older than 3.9. If pip instead complains about failing to build the MariaDB module, make sure you have the MariaDB C Connector, GCC, and the Python headers installed (e.g., on Fedora/RHEL: dnf install mariadb-connector-c mariadb-connector-c-devel gcc python39-devel).

Quick Start

This section will help you set up a basic testing/development environment. These settings should not be used for production installations.

Clone this repo and cd into it. Assuming you're using RHEL or Fedora, run the following commands. Users of other Linux distributions should translate commands accordingly (e.g., Debian/Ubuntu users can use apt instead of dnf, non-podman users can replace podman with docker, etc.).

# Install dependencies
sudo dnf install python39 python39-devel mariadb-connector-c mariadb-connector-c-devel gcc podman
pip3.9 install -r requirements.txt
# We'll use a MariaDB container as our local database 
podman pull mariadb
# Fill in the <bracketed> values before running the command below, which creates a root user, empty database, and database-owning user
# You may use any arbitrary alphanumeric string for each of these values
podman run --detach --env MARIADB_DATABASE=<db_name> --env MARIADB_USER=<db_username> --env MARIADB_PASSWORD=<db_password> --env MARIADB_ROOT_PASSWORD=<root_password> --name oaa-mariadb -p 3306:3306 mariadb:latest

After creating the empty database as shown in the above step, create a basic .env file at the root of the repo using the template below (filling in the values with the same values you used above when creating your database, where applicable).

AA_PD_API_TOKEN=<your PagerDuty API token>
AA_PD_TEAMS=<a colon-separated list of PagerDuty team IDs>
AA_RO_DB_STRING=mariadb+mariadbconnector://<db_username>:<db_password>@127.0.0.1:3306/<db_name>
AA_RW_DB_STRING=mariadb+mariadbconnector://<db_username>:<db_password>@127.0.0.1:3306/<db_name>
AA_QUESTION_CLASSES=QMostFrequent:QNeverAcknowledged:QNeverAcknowledgedSelfResolved:QAcknowledgedUnresolved:QSelfResolvedImmediately:QSREResolvedImmediately:QFlappingShift

You may now proceed with populating your incident cache database using updater.py. As explained below, fully-populating your database may take hours. You can instead only download the last 100 incidents into your database using the following command (which should only take 5-10 minutes).

python3 updater.py --limit 100 --verbose

Once your database is populated with at least 10 incidents/alerts, you can start a development web server by running the wsgi.py file, as shown below. Note that you may have to replace python3 with python39, python3.9, or similar depending on your distribution.

python3 wsgi.py

This command should produce a URL that you can now open in your browser to view the webUI. From this point on, you may now make changes to the codebase and then kill and relaunch the wsgi.py process to see your changes take effect. If you reboot your computer or otherwise stop the Docker/Podman container hosting your database, you can relaunch it by running podman start oaa-mariadb.

You have now completed Quick Start. The following sections contain more detailed information about how to configure and use OAA.

Initial Caching Database Setup

The PagerDuty API is too slow/rate-limited to be used directly by the web application. Instead, you'll need to set up, populate, and regularly refresh a caching database.

Create an empty database in your SQL server and two service accounts: one with full (admin) privileges over the database, and another only with read privileges. Create a file named .env at the root of this repo and fill it in like so:

AA_PD_API_TOKEN=<your PagerDuty API token>
AA_PD_TEAMS=A_list_of:colon-separated_PagerDuty_team_IDs:that_look_like_this:XY1234Z

# DB_STRINGs should be SQLAlchemy database engine URLs.
# See docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine
AA_RO_DB_STRING=sqlite+pysqlite:///:memory: # Access a read-only account
AA_RW_DB_STRING=sqlite+pysqlite:///:memory: # Access a read-write account

# QUESTION_CLASSES is a list of the questions you'd like to display on the web UI.
# These should be class names from questions.py
AA_QUESTION_CLASSES=QMostFrequent:QNeverAcknowledgedSelfResolved:QFlappingShift

Note: if you're just experimenting/testing, feel free the leave the SQLite database string shown above as is. Just know that this will create the database in-memory and will be dropped as soon as the updater script/web application exists.

Then populate the database using updater.py, as specified in the help message shown below.

$ ./updater.py --help
usage: updater.py [-h] [-s SINCE] [-u UNTIL] [-l LIMIT] [-v]

Updates the OSD alert-analysis tool's cache of PagerDuty incidents and alerts

options:
  -h, --help            show this help message and exit
  -s SINCE, --since SINCE
                        start of the caching time window in ISO-8601 format (default: 30d ago)
  -u UNTIL, --until UNTIL
                        end of the caching time window in ISO-8601 format (default: now)
  -l LIMIT, --limit LIMIT
                        maximum number of incidents to cache (default: 10000)
  -b  DAYS, --backfill  DAYS
                        do a normal run, then check to see if the oldest record in the cache is 
                        at least DAYS days old. if it's not, update cache until it is, batching 
                        in sizes of LIMIT if necessary. --since and --until have no effect after
                        the initial run.
  -v, --verbose

Note that this process can take several hours. If successful, output will look like:

$ ./updater.py --since="2022-01-01" --until="2022-02-01"
Updating incident cache...done. Cached 9786 incidents.
Updating alert cache...done. Cached 10012 alerts.

We recommend setting up a cronjob that runs ./updater.py --backfill 90 once or twice per day in order to cache at least 90 days of history and refresh the 10,000 most-recently-created incidents. Incidents older than 90 days will not be deleted from the cache, they're simply not updated.

Web Application Setup

We recommend running the web application as a container. This project conforms to the Source-To-Image standard, so you have several build options.

Note: the following instructions assume you've got your database running, populated, and network-accessible to the execution enviroment (e.g., container) you're about to create.

Building and running a Docker image locally

Clone this repo, cd into it, and fill out your .env file (see above). Then run:

docker build . --tag "alert-analysis:latest"
docker run -d -p 8080:8080 --name my_aa --env-file .env alert-analysis:latest

Then navigate to http://localhost:8080 in your browser to see the UI. If you get a blank screen, the application is probably having trouble connecting to your database. Run docker logs my_aa and see the note above for more details.

Deploying on OpenShift

Create and fill out your .env file (see above). We'll use oc new-app to build our image. If your OpenShift cluster can access this repository (e.g., because this is a public repo and your cluster has internet access), run the following to create an app that OpenShift can automatically rebuild whenever an update is pushed to main*:

export REPO_URL="<Git-clone-able SSH or HTTPS repo URL>"
oc new-app $REPO_URL --strategy=source --env-file=.env --name=my-aa
oc patch svc/my-aa --type=json -p '[{"op": "replace", "path": "/spec/ports/0/port", "value":80}]'

If your OpenShift cluster does not have access to this repo (e.g., because this is private or VPN-restricted repo), clone the repo onto your local machine and run the following to generate a one-time binary build:

export SRC_PATH="<full path to a local clone of this repo>"
oc new-app --binary --strategy=docker --env-file=.env --name=my-aa
oc start-build my-aa --from-dir=$SRC_PATH
oc expose deployment my-aa --port 80 --target-port 8080

Note: the use of the --env-file flag bakes your .env directly into the generated Deployment in plaintext. A much more secure deployment would instead use Secrets and ConfigMaps to provide the necessary config values to the running pod.

Finally, expose the app to the world outside the cluster using:

oc expose svc/my-aa
echo "Now head to http://$(oc get route/my-aa -o jsonpath={.spec.host})"

*To find the webhook URL needed for Git-triggered builds, open the OpenShift web console, find the BuildConfiguration generated by oc new-app, scroll to the bottom of the page, and copy the webhook URL shown.

Web Application Usage

The web application is currently a single-page collection of tables answering the questions specified by the AA_QUESTION_CLASSES config value. The columns of each table support sorting (click the little arrows next to column name) and filtering (enter your search term or filter into the cell below the column name)

Errata

  • Some alert names will be abbreviated when loaded into the cache. See the Alert.standardize_name() function in models.py to see how this works.

About

Web application and supporting CLI tools for analyzing OpenShift Dedicated cluster alert data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.2%
  • Dockerfile 0.8%