-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trilinos_PR_cuda-11.4.2-uvm-off PR build not running/submitted to CDash starting 2024-01-24 #12696
Comments
It appears that all PR GPU machines are down. @sebrowne has a ticket in. |
@sebrowne and @achauphan, given the importance of these PR build machines, is it possible to have some type of monitoring system for these machines so that if they go down, someone can be notified ASAP? There must be monitoring tools that can do this type of thing. (Or Jenkins should be able to do this since it is trying to run these jobs; perhaps with the right Jenkins plugin like described here?) I know all of this autotester and Jenkins stuff is going to be thrown away once Trilinos moves to GHA, but the same issues can occur with that process as well. The problem right now is that when something goes wrong with the Trilinos infrastructure, it is Trilinos developers that have to detect and report the problem. Problems with the infrastructure will occur from time to time (that is to be expected), but when they do, it would be good if the people maintaining the infrastructure could be directly notified and not have to rely on the Trilinos developers to detect and report problems like this. |
Agreed, will bring this up at our retro next week to see if there is a reasonable solution we can setup in the interim of AT2. Currently, there is an email sent by Jenkins when a node goes offline which I had missed. |
As a status update, this morning all GPU nodes were brought back online. One node has been brought back offline manually as it is having very odd and poor performance and picked up the first few jobs this morning. |
FYI: It is not just the CUDA build that has failed to produce PR testing results on CDash. It is also the NOTE: It would also be great to set up monitoring of CDash looking for missing PR build results as well. That is similar to looking for randomly failing tests (see TriBITSPub/TriBITS#600) but is does not require knowing the repo versions but it is complicated by the challenge of trying to group builds on CDash that are part of the same PR testing iteration (because all you have to go on is the Build Start Time which are all different but are typically within a few minutes of each other). I suggested this in this internal comment. |
PR #12707 was has been hit with a couple issues in the gcc build mentioned above
as well as this build
That was on ascic166, the node ran out of memory |
CC: @trilinos/framework, @sebrowne, @achauphan
Description
As shown in this query, the Trilinos PR build
Trilinos_PR_cuda-11.4.2-uvm-off
has not posted full results to CDash since early yesterday (2024-01-24) showing:Yet, many PR iterations have run and posted to CDash in that time as shown in this query for the
Trilinos_PR_clang-11.0.1
PR build, for example, showing:That is a bunch of PR that are not passing their PR test iterations and will not be getting merged. (This explains why it took so long for the autotester to run on my new PR #12695.)
Looks like this has so far impacted the PRs:
The text was updated successfully, but these errors were encountered: