-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] Azure Mariner CI jobs regularly failing: "File not found: 'docker'" #6316
Comments
2 more on #6331:
|
And one on #6019:
|
I will investigate this. Thanks for creating the issue. |
Thanks! I'll keep sharing these links, but will stop if you tell me it's not necessary any more. Saw many more in the last 24 hours:
Since we made this change in #6222 (December 19, 2023), it definitely seems that the rate of pipeline failures on Azure DevOps has increased. There's some data on that available at https://dev.azure.com/lightgbm-ci/lightgbm-ci/_pipeline/analytics/stageawareoutcome?definitionId=1&contextType=build. Over the last 30 days, for example, 62% of all LightGBM's jobs on Azure DevOps have failed, and 30% of those have been in the |
I'm not going to post as many of these since I think it's clear at this point that there's an issue. But want to share this from a run on #6341 about 16 hours ago... 8 of 18 jobs on the same run failed at initialization with the error reported in the original post in this issue. |
In addition to these frequent failures, over the last few days I've observed what looks like significantly reduced capacity. For example, the Linux tasks in this build have all been stuck in @shiyu1994 could you look into that? Is LightGBM competing with other pipelines for capacity? |
I just restarted the jobs on #6364 and saw all the Linux jobs on Azure DevOps get queued, with messages like this
Some jobs have been stuck in "queued", waiting to be picked up, for more than 4 days. @shiyu1994 can you please look into this? Is LightGBM competing with other projects, or is Azure's capacity for these types of VMs just very limited? We really rely heavily on these Linux jobs on Azure DevOps and this disruption is blocking development on the project. |
I've added the #6394 was opened about 12 hours ago, for example, and all of its Linux CI jobs are stuck in These jobs just not being run at all is different from the other issue reported in the first post here ( |
The VM scale set on Azure for the ci is failed. I'm fixing this. |
Thanks! Whenever it's fixed, I can take care of rebuilding + merging all the already-approved PRs. |
Over the last 2 weeks at least, I have not seen the "File not found: 'docker'" issue a single time! 🎉 It's becoming much more common that all of CI across all providers passes with 0 manual re-runs, as just happened on I wonder if some combination of #6407, #6416, and other fixes made by Azure have all contributed to stabilizing this. Either way, I think we can close this issue for now and re-open it if the problems come up again. Thanks for all your help @shiyu1994 !!! |
Description
Since switching the
Linux
CI jobs at Azure DevOps to Mariner Linux in #6222, we've seen an increased rate of Azure DevOps jobs failing.Typically with an error like this in the "initialize job" stage:
Creating this issue to track those cases.
Reproducible example
A few recent cases:
Linux inference
(build link)Linux sdist
(build link)Linux_latest bdist
(build link)Linux_latest gpu_pip
(build link)All had similar logs, like this:
Environment info
N/A
Additional Comments
Pulling this into its own issue (original conversation started in #6307 (comment)).
The text was updated successfully, but these errors were encountered: