-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make nightly failures more visible to developers #127
Comments
Additional points from offline discussion:
|
I'm generally supportive of making nightly test failures harder to ignore, and I think blocking PR CI is an effective tool for that. Support this! When this rolls out, let's be vigilant in |
Contributes to rapidsai/build-planning#127 Relies on rapidsai/shared-actions#32 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) URL: #17596
What's the strategy in case CI is failing for 1+ week and we need an urgent fix? For example, now with the holidays if some upstream package breaks CI immediately as everyone goes out the door it means when we come back it will be hard to get a fix merged. |
The rest of CI will run even if the nightly job fails, so we can request admin merges if we see that a PR is otherwise passing CI. |
Contributes to rapidsai/build-planning#127 This PR cannot be merged unless nightly CI has passed within the past 7 days, so if it remains unmerged that will itself be an indication that nightly CI needs fixing. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) URL: #1508
Currently when the RAPIDS nightly runs fail, we rely on developers to actively monitor either the GHA tab or the Slack channels where we post these results. This results in some projects having their nightly CI broken for long periods of time, often indicating real bugs that go unfixed until release (or in the worst case, never). To improve this situation, I propose that we introduce an extra check to our PR CI that verifies how long it has been since the last failure of a CI job, and if it has been too long (by some metric) then we block PR merging by failing the job. This check will force more developers to be aware of the failures and deal with them relatively proactively.
The text was updated successfully, but these errors were encountered: