Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] fix linux runners running out of disk space (fixed #6635) #6636

Merged
merged 18 commits into from
Sep 2, 2024

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Sep 2, 2024

fixes #6635

It seems that CI jobs on the self-hosted linux runner pool in Azure DevOps are failing because those runners don't have enough free disk space.

I found that around 55% of the disk space on those runners is occupied by no-longer-user container images (#6635 (comment)). This proposes introducing one new CI job that deletes images that are more than 30 days old and not currently in usage by any containers.

Notes for Reviewers

Why not run this cleanup on every CI job?

It'd add around 30-45 seconds to every run of every Azure DevOps linux CI job.

I don't think that's necessary... running this cleanup once per CI run (one one randomly-assigned runner in the self-hosted pool) should hopefully be enough to prevent the disk space from filling up way again.

I'm not sure how many total runners there are in the pool introduced in #6407 ... @shiyu1994 could you tell us? That would help with understanding how many runs might be required to clean up every node at least once.

How to test this

On any Azure DevOps run, check the warnings tab

Screenshot 2024-09-02 at 12 20 28 AM

(example build link)

Over the next few days, we should see the number of such warnings decrease... and eventually see 0 warnings related to disk usage on the linux runners.

Also check the output of the Maintenance job.

Screenshot 2024-09-02 at 12 22 27 AM

(example build link)

The end of those runs should show only 2 container images on the the host... the one used on the Linux jobs and the ones used on the Linux_latest jobs.

Over the next few days, we should see log messages like "Total reclaimed space: 14.12GB". Those should eventually stop showing up, as all the runners are cleaned up.

So is this job temporary?

No, I'm proposing this as a permanent addition to LightGBM's CI.

That way, we'll automatically be protected against disk-space issues in situations like the following:

  • switching to a new image for ubuntu-latest (via new Ubuntu version or just security patches pushed to the official Ubuntu images)
  • changes to the manylinux image (e.g. via new commits to https://github.com/guolinke/lightgbm-ci-docker)

@jameslamb jameslamb changed the title WIP: [ci] fix linux runners running out of disk space [ci] fix linux runners running out of disk space (fixed #6635) Sep 2, 2024
@jameslamb jameslamb marked this pull request as ready for review September 2, 2024 05:30
.vsts-ci.yml Show resolved Hide resolved
@jameslamb jameslamb requested a review from borchero September 2, 2024 16:31
@jameslamb
Copy link
Collaborator Author

Thanks very much for the review @borchero !

@jameslamb jameslamb merged commit b9a2262 into master Sep 2, 2024
44 checks passed
@jameslamb jameslamb deleted the ci/investigate-ci branch September 2, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ci] AzureDevops jobs failing: NoSpaceLeftError: No space left on devices.
2 participants