Note: This is is the public release repository containing mostly build artifacts. Development for this action takes place at the push-gha-metrics-action-source repository.
Github actions' insights for github actions is lacking in easily actionable information that someone would want if they are looking to optimize their CI/CD pipelines. The following questions are difficult to answer at a glance using the tools that Github UI provides you, whereas the goal of this action is to gradually answer them, starting with the highest impact ones first.
Question | Over Time | Over Commits, Pull Requests, Releases, Etc | Within A Repository | Across All Repositories in an Organization |
---|---|---|---|---|
Which workflows are taking up the most time? | ✔️ | ✔️ | ✔️ | ✔️ |
Which job takes up the most time? | ✔️ | ✔️ | ✔️ | ✔️ |
Which repository is responsible for the most run time? | ✔️ | ✔️ | ✔️ | ✔️ |
How often does a workflow fail? | ❌ | ❌ | ❌ | ❌ |
How often does a job fail? | ❌ | ❌ | ❌ | ❌ |
How long is a job queued for? | 🏗️ | 🏗️ | 🏗️ | 🏗️ |
How long is a workflow queued for? | 🏗️ | 🏗️ | 🏗️ | 🏗️ |
How many jobs are currently being executed? | 🏗️ | 🏗️ | 🏗️ | 🏗️ |
We want to answer these questions so we are able to identify bottlenecks in developer productivity and prioritize them using quantitative analysis.
This action creates logs that contain information about the currently executing github action's job, then pushes them to a Loki endpoint for metrics processing. From there, we visualize these metrics within Grafana. This enables us to answer the aforementioned questions.
sequenceDiagram
participant L as Loki
participant F as Grafana
participant A as Metrics Action
participant J1 as Workflow Run A: Job Run B
participant G as Github
G-->+J1: Start Job Run B
J1->>J1: Executes Step 1
Note Over J1,A: Let step "K" be the step that contains this action
J1->>A: Executes Step K
A->>A: Detects that it's not in "post-step" phase, no-ops
J1->>J1: Executes Step N
J1->>J1: Executes Post-Step N
J1->>A: Executes Post-Step K
A->>A: Detects that it is in "post-step phase"
rect rgb(200, 150, 255)
Note Over G, A: Local and remote metadata collection
A->>A: Take current timestamp
A->>J1: Get local job metadata on runner
A->>G: Request remote job metadata
G->>A: Return remote job metadata
A->>A: Merge metadata
A->>A: Calculate execution duration
end
A->>L: Push calculated metrics as Loki Logs
J1->>J1: Executes Post-Step 1
J1-->-G: Finish Job Run B
L->>L: Store calculated metrics
F->>L: Send metrics query
L->>L: Perform metrics query
L->>F: Send query results
F->>F: Visualize metrics
This action requires the following permissions:
permissions:
actions: read
Make sure you explicitly grant this permission to the github action's token if
you have the permissions
key set in your workflow.
You should have this action being used as the first step in every single job
of each workflow you'd like to collect metrics about. Every job name in your
workflow must be unique, see Job Name vs Id
and Matrices
in the Notes
section.
See the action.yml file for information on inputs that this action accepts.
This project is synced through a release pipeline in the form of Github workflows. There exist two release jobs, one for official releases, and one for snapshot releases. Both jobs execute a shell script as seen in sync-from-source.
This job runs once every hour. If a new release exists, it will pull the changes from
the source repository. Released changes will be pushed to the chore/update-push-gha-metrics-action
branch with a PR titled "Update push-gha-metrics-action" targeting the main
branch.
This job runs once every hour. This workflow will synchronize the source repository
and the snapshot
branch. This branch should be used for testing purposes.
Most of the following considerations were not used due to certain architectural decisions made which would cause friction either due to:
- Requiring PAT tokens for monitoring a repository
- For multi-repository monitoring over an org, requiring a PAT token with org level access
- Creating labels with unbounded cardinality, which can cause severe degradation of the metrics processing service
- Collecting metrics by mass querying the Github HTTP API, or large queries against Github's GraphQL api, both resulting in rate limiting
- Collecting workflow + job metrics by having metrics collection done by a workflow that triggered from other workflows being completed. This results in doubling the amount of jobs being executed in a repository, making it very expensive.
- Collecting workflow + job metrics by having a standalone service, this results in having to maintain a long-lived service along with protecting any secrets it needs to access Github's API.
- Metrics collection / querying / vis revolving around a tech stack that isn't Prometheus / Loki / Grafana. Adopting another tech would make the user experience loaded with friction rather than using what we already have.
- https://github.com/tchelovilar/github-org-runner-exporter
- https://github.com/transferwise/github-actions-api-exporter
- https://github.com/Spendesk/github-actions-exporter
- https://github.com/marketplace/actions/datadog-actions-metrics
- https://github.com/kaidotdev/github-actions-exporter
- https://github.com/cpanato/github_actions_exporter
- https://docs.datadoghq.com/continuous_integration/setup_pipelines/github/
Given the following workflow, a simplified sequence diagram is provided that shows us at what points github is updated with various timestamps. With these timestamps we can calculate useful durations as metrics.
name: A
on:
workflow_dispatch:
push:
pull_request:
jobs:
B:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: ./.github/actions/push-metrics/
with:
basic-auth: ${{ secrets.GRAFANA_INTERNAL_BASIC_AUTH }}
hostname: ${{ secrets.GRAFANA_INTERNAL_HOST }}
org-id: ${{ secrets.GRAFANA_INTERNAL_TENANT_ID}}
this-job-name: B
- run: echo "test test test"
C:
needs: [B]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: ./.github/actions/push-metrics/
with:
basic-auth: ${{ secrets.GRAFANA_INTERNAL_BASIC_AUTH }}
hostname: ${{ secrets.GRAFANA_INTERNAL_HOST }}
org-id: ${{ secrets.GRAFANA_INTERNAL_TENANT_ID}}
this-job-name: C
- run: echo "test test test"
sequenceDiagram
participant W as Workflow Run A
participant G as Github API
participant J1 as Workflow Run A: Job Run B
participant J2 as Workflow Run A: Job Run C
rect rgb(160,638,134)
Note Over G,W: Workflow A Initialization Phase
G->>G: Trigger for Workflow A Occurs
G->>W: Create Workflow Run A
W->>G: Update Workflow Run A's "created_at" property
par Queueing of Workflow Run A and its Job Constituents
G-->+W: Queue Workflow Run A
G-->+J1: Queue Job Run B
G-->+J2: Queue Job Run C
end
G->>G: Waits for Available Runner
G->>G: Gets Available Runner
W-->-G: De-queue Workflow Run A
end
G-->+W: Start Workflow Run A
W->>G: Update Workflow Run A's "updated_at" + "run_started_at" property
Note over J2,G: Job B has no Job dependencies, Job C Depends on Job B
G->>G: Selects jobs with no dependencies for workflow run A
J1-->-G: De-queue Job Run B
G-->+J1: Start Job Run B
rect rgb(191, 223, 255)
Note over J1: Job Run B Execution Phase
W->>G: Update Workflow Run A's "updated_at" property
J1->>G: Update Job Run B's "started_at" property
J1-->J1: Executes Step 1
J1-->J1: Executes Step N
J1-->J1: Executes Post-Step N
J1-->J1: Executes Post-Step 1
J1->>G: Update Job Run B's "completed_at" property
W->>G: Update Workflow Run A's "updated_at" property
end
J1-->-G: Finish Job Run B
J2-->-G: De-queue Job Run C
G-->+J2: Start Job Run C
rect rgb(200, 150, 255)
Note over J2: Job run C Execution Phase
W->>G: Update Workflow Run A's "updated_at" property
J2->>G: Update Job Run C's "started_at" property
J2-->J2: Executes Step 1
J2-->J2: Executes Step N
J2-->J2: Executes Post-Step N
J2-->J2: Executes Post-Step 1
J2->>G: Update Job Run C's "completed_at" property
W->>G: Update Workflow Run A's "updated_at" property
end
J2-->-G: Finish Job Run C
W->>G: Update Workflow Run A's "updated_at" property
W-->-G: Finish Workflow Run A
We gather additional context on the currently running job by querying the List Jobs for a Workflow Run Attempt endpoint, then filtering the returned jobs by their job name to find the currently executing one.
Unfortunately, the github runner only reports the
job id, not the job name
that
we need to find our job by. If you have a job[job_id].name
specified, you
must specify the this-job-name
parameter to this action, see the below
Matrices section for more info. Also, you must not specify the same
job name for two different job id's, otherwise the same situation will occur:
the metrics collection action will fail to properly resolve the correct job.
The name of the current job being executed given by the current runner context does not align with the name of the job reported by the API. Even when names of the job are unique from dynamic job names, there is a discrepancy.
This is due to the Github runner reporting the currently executing job by its
key
within the workflow file, while the Github API reports the currently
executing job by its name
value, with the key
value being the fallback if
the name
isn't defined.
What this means is that this action cannot discover its own job state from the given context, if the job being executed is part of a matrix. An input value that gives the proper job name to lookup by is needed during matrix execution.
jobs:
my-parallel-job:
# This generates a unique name within the matrix itself. Now we need to pass this value to this action so it can do a successful lookup
name: my-unique-name-${{ matrix.key1 }}
runs-on: ubuntu-latest
strategy:
matrix:
key1: ["unique", "values"]
steps:
- uses: push-metrics
with:
# Now we use the same name here, so our metrics collection works properly
this-job-name: my-unique-name-${{ matrix.key1 }}
For more information, see: actions/runner#852