Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Status: Improve automation #148

Open
amotl opened this issue Nov 4, 2024 · 10 comments · May be fixed by #222
Open

Build Status: Improve automation #148

amotl opened this issue Nov 4, 2024 · 10 comments · May be fixed by #222
Assignees
Labels
help wanted Extra attention is needed

Comments

@amotl
Copy link
Member

amotl commented Nov 4, 2024

About

This is a proposal to improve the sitation around the Build Status page on different ends.

Q&A

@kneth asked the other day:

Is there a way we can automate checks to ensure that new projects will be flagged?

Originally posted by @kneth in #146 (review)

@amotl amotl changed the title Build Status: Add automation Build Status: Improve automation Nov 4, 2024
@amotl
Copy link
Member Author

amotl commented Nov 4, 2024

There is also a gap in automation for existing projects, causing more serious troubles. CI/GHA jobs regularly hibernate, because of the reason outlined below.

Important

Please make sure to also add this obstacle to your personal recurrent check list. 🙏

Image

@kneth
Copy link

kneth commented Nov 5, 2024

also a gap in automation for existing projects

There is a longer discussion in https://stackoverflow.com/questions/67184368/prevent-scheduled-github-actions-from-becoming-disabled about work-arounds.

@amotl
Copy link
Member Author

amotl commented Nov 5, 2024

GitHub Workflow Immortality sounds nice, thanks! Now, who is going to implement it? ;]

/cc @bmunkholm

@amotl
Copy link
Member Author

amotl commented Dec 5, 2024

Evaluating the outcome automatically, and making us aware of any significant events

@kneth suggested the most important detail here would be to automate the evaluation of the outcome from builds of multiple projects. @amotl agrees. Until we have it, he kindly agreed to volunteer to check the Build Status regularly, and act or dispatch accordingly.

Idea 1

We are suggesting to possibly aggregate the outcome from multiple projects, and to submit a notification to Slack or the crate-alerts repository when there is an event that will need attention, which is probably the right choice. This variant might need additonal thoughts, because you would need to consider flakyness and hysteresis aspects.

Idea 2

An alternative to the push-notification-on-error would be to just post a summary into a dedicated Slack channel each morning, of all jobs that are failing. That would be much easier, because it's a straight-forward pipeline without needing to define what a significant event is: Build Status | grep FAILED > summary.md | post-to-slack. Running that once a day is sufficient regarding granularity and promptness, because it is mostly about monitoring dependencies of downstream packages, and CrateDB Nightly itself, which will only be released once a day.


In both cases, we need to pick a place where to run the automation unattended. In a perfect world, everything would be self-contained within a single GHA workflow definition.

@amotl amotl added the help wanted Extra attention is needed label Dec 5, 2024
@kneth kneth self-assigned this Dec 6, 2024
@amotl
Copy link
Member Author

amotl commented Dec 6, 2024

Build Status | grep FAILED

This part of the pipeline would probably be best implemented using a little Python program, which uses a list of direct URLs to arbitrary GHA workflows files as input, goes through them, invoking a call to the GitHub API to find out about their outcomes, and return a summary, optionally filtered by status?

@amotl
Copy link
Member Author

amotl commented Dec 7, 2024

GitHub repository names from Build Status Markdown

Currently, the source of the Build Status page is just Markdown. This little Bash function extracts a list of GitHub repository names. It uses HTTPie and Perl.

function build-status-gh-repos() {
  http --follow https://github.com/crate/crate-clients-tools/raw/refs/heads/main/docs/status.md | \
    grep "github.com.*yml" | perl -pe 's#.*github\.com/(.+?)/(.+?)/.*#\1/\2#' | sort | uniq
}

The command generates the list of all relevant public-facing repositories, including artefacts we or others are shipping that work well together with CrateDB, or include relevant integration tests, without crate/crate itself.

Demo

build-status-gh-repos
MagicStack/asyncpg
brianc/node-postgres
crate-workbench/dbt-cratedb2
crate/academy-fundamentals-course
crate/activerecord-crate-adapter
crate/crash
crate/crate-admin
crate/crate-dbal
crate/crate-java-testing
crate/crate-operator
crate/crate-pdo
crate/crate-python
crate/crate_ruby
crate/cratedb-airflow-tutorial
crate/cratedb-examples
crate/cratedb-prometheus-adapter
crate/cratedb-sqlparse
crate/cratedb-toolkit
crate/croud
crate/micropython-cratedb
crate/mlflow-cratedb
crate/pytest-cratedb
crate/sqlalchemy-cratedb
npgsql/npgsql
pgjdbc/pgjdbc
psycopg/psycopg
psycopg/psycopg2

GHA workflow status (per-repository)

This little function written in Bash acquires relevant information about recently failed scheduled workflow runs for a specific repository from GitHub's API. It uses HTTPie and jq, and does not require a GitHub authentication token for a few requests until it trips a rate limit.

function gha-failures() {
  repo=$1
  http ${HTTPIE_OPTIONS} https://api.github.com/repos/${repo}/actions/runs \
    event==schedule status==failure "created==:>=$(date -d 'yesterday' '+%Y-%m-%d')" | \
    jq '[ .workflow_runs[] | {name, display_title, path, html_url, status, conclusion, created_at} ]'
}

Note

The parameter created needs to be populated dynamically, to only receive recent occurrences. GitHub's API only accepts ISO formats here, specifying relative time deltas like -1 day, today, or yesterday is not possible, so the recipe needs to use the GNU date command. On macOS systems, you may need to use gdate command instead of date within the function recipe above.

Demo

gha-failures crate/cratedb-examples
[
  {
    "name": "AutoML",
    "display_title": "AutoML",
    "path": ".github/workflows/ml-automl.yml",
    "html_url": "https://github.com/crate/cratedb-examples/actions/runs/12196847284",
    "status": "completed",
    "conclusion": "failure",
    "created_at": "2024-12-06T10:05:50Z"
  }
]

Optionally use authentication when running into rate limits. It's an (invalidated) personal access token (classic), using the "workflow" scope.

GITHUB_TOKEN=ghp_00r4G0tGxLDT5RGOwWNw7tZhFnK5fT0uMfoo
HTTPIE_OPTIONS="--auth-type bearer --auth ${GITHUB_TOKEN}"

Slack Notifications

A quick search discovered those GitHub <-> Slack integrations. If some of it are capable enough, we may even think about omitting the pre-aggregation/-summarization step above.

Those seem to be the canonical ones.

Other people at Crate.io are already using GitHub's scheduled reminders, to converge relevant information into a Slack channel each morning at 9 am.

@amotl
Copy link
Member Author

amotl commented Dec 7, 2024

All at once

The program cratedb-ecosystem-build-failures.sh includes both functions presented above, and can be used as the baseline for ad hoc explorations and for conducting orientation flights. Output is still rough, and improving it will probably need a rewrite in pure Python.

Install

wget https://gist.github.com/amotl/d7de8de01eafba602dd9c6bfa9cfac6d/raw/cratedb-ecosystem-build-failures.sh

Usage

The GITHUB_TOKEN is an (invalidated) personal access token (classic), using the "workflow" scope. Create your own one per https://github.com/settings/tokens.

GITHUB_TOKEN=ghp_00r4G0tGxLDT5RGOwWNw7tZhFnK5fT0uMfoo
HTTPIE_OPTIONS="--auth-type bearer --auth ${GITHUB_TOKEN}"
bash cratedb-ecosystem-build-failures.sh

@amotl
Copy link
Member Author

amotl commented Feb 10, 2025

Just sharing a little walkthrough that uses gh to inquire job runs and their outcomes using the GitHub API from the command line.

List failed job runs

Recent failures, max. 20

gh run list --repo=crate/cratedb-toolkit --event=schedule --status=failure --limit=20
STATUS  TITLE            WORKFLOW         BRANCH  EVENT     ID           ELAPSED  AGE
X       Tests: Common    Tests: Common    main    schedule  13232774906  3m59s    about 18 hours ago
X       Tests: Common    Tests: Common    main    schedule  13221829399  3m52s    about 1 day ago
X       Tests: Common    Tests: Common    main    schedule  13211733709  4m1s     about 2 days ago
X       Tests: Common    Tests: Common    main    schedule  13192423949  3m56s    about 3 days ago
X       Tests: Common    Tests: Common    main    schedule  13023865908  2m37s    about 12 days ago
X       Tests: MongoDB   Tests: MongoDB   main    schedule  13023830389  1m45s    about 12 days ago
X       Tests: DynamoDB  Tests: DynamoDB  main    schedule  13023800407  2m39s    about 12 days ago
X       Tests: Common    Tests: Common    main    schedule  13002845328  2m38s    about 13 days ago
X       Tests: Common    Tests: Common    main    schedule  12981313751  2m44s    about 14 days ago
[...]

Recent failures, since yesterday

gh run list --repo=crate/cratedb-toolkit --event=schedule --status=failure --created='>2025-02-09'
STATUS  TITLE          WORKFLOW       BRANCH  EVENT     ID           ELAPSED  AGE
X       Tests: Common  Tests: Common  main    schedule  13232774906  3m59s    about 18 hours ago

Inspect job run

Display jobs

gh run view --repo=crate/cratedb-toolkit 13232774906
X main Tests: Common · 13232774906
Triggered via schedule about 18 hours ago

JOBS
X Generic: Python 3.8 on OS ubuntu-latest in 3m13s (ID 36932526764)
  ✓ Set up job
  ✓ Initialize containers
  ✓ Acquire sources
  ✓ Set up Python
  ✓ Set up project
  X Run linter and software tests
  - Upload coverage to Codecov
  - Post Set up Python
  ✓ Post Acquire sources
  ✓ Stop containers
  ✓ Complete job

Display logs of failed workflow step

gh run view --repo=crate/cratedb-toolkit 13232774906 --log-failed
Generic: Python 3.12 on OS ubuntu-latest	Run linter and software tests	2025-02-10T03:14:31.7180415Z FAILED tests/retention/test_cli.py::test_run_reallocate - sqlalchemy.exc.ProgrammingError: (crate.client.exceptions.ProgrammingError) UnsupportedFeatureException[Joins do not support this operation]
Generic: Python 3.12 on OS ubuntu-latest	Run linter and software tests	2025-02-10T03:14:31.7181358Z [SQL:
Generic: Python 3.12 on OS ubuntu-latest	Run linter and software tests	2025-02-10T03:14:31.7181557Z WITH partitions AS (
Generic: Python 3.12 on OS ubuntu-latest	Run linter and software tests	2025-02-10T03:14:31.7181769Z
Generic: Python 3.12 on OS ubuntu-latest	Run linter and software tests	2025-02-10T03:14:31.7181945Z         SELECT
[...]

@kneth
Copy link

kneth commented Feb 11, 2025

It looks like using gh is a simpler solution than custom shell scripts. Thanks for sharing!

@amotl
Copy link
Member Author

amotl commented Feb 11, 2025

Yeah, gh is sweet, to quickly ramp up getting accustomed better with the GitHub API. Then, converging this into Python code gets you beautiful Markdown reports quickly, without needing to fall back to arcane scripting again to glue things together.

About

rapporto qa inspects GitHub projects, finds GHA workflow runs that failed recently, and creates reports about them in Markdown format, which can be shared on Discourse, GitHub, or Slack. Find an example report on GitHub below, and on Slack per qa-bot 2025-02-10, 2025-02-11.

Usage

Download file enumerating list of curated repositories to inspect.

open https://github.com/crate/cratedb-github-summary/raw/refs/heads/main/cratedb-repositories-ecosystem.txt

Generate report about failed CI runs within now-24h.

export GITHUB_TOKEN=ghp_600VEZtdzinvalid7K2R86JTiKJAAp1wNwVP
python rapporto.py qa --repositories-file=cratedb-repositories-ecosystem.txt

Example

Open example report

QA report 2025-02-10

A report about GitHub Actions workflow runs that failed recently (now-24h).

Scheduled

Pull requests

Dynamic

@kneth kneth linked a pull request Feb 13, 2025 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants