[ci] fix linux runners running out of disk space (fixed #6635) #6636

jameslamb · 2024-09-02T02:54:39Z

It seems that CI jobs on the self-hosted linux runner pool in Azure DevOps are failing because those runners don't have enough free disk space.

I found that around 55% of the disk space on those runners is occupied by no-longer-user container images (#6635 (comment)). This proposes introducing one new CI job that deletes images that are more than 30 days old and not currently in usage by any containers.

Notes for Reviewers

Why not run this cleanup on every CI job?

It'd add around 30-45 seconds to every run of every Azure DevOps linux CI job.

I don't think that's necessary... running this cleanup once per CI run (one one randomly-assigned runner in the self-hosted pool) should hopefully be enough to prevent the disk space from filling up way again.

I'm not sure how many total runners there are in the pool introduced in #6407 ... @shiyu1994 could you tell us? That would help with understanding how many runs might be required to clean up every node at least once.

How to test this

On any Azure DevOps run, check the warnings tab

(example build link)

Over the next few days, we should see the number of such warnings decrease... and eventually see 0 warnings related to disk usage on the linux runners.

Also check the output of the Maintenance job.

(example build link)

The end of those runs should show only 2 container images on the the host... the one used on the Linux jobs and the ones used on the Linux_latest jobs.

Over the next few days, we should see log messages like "Total reclaimed space: 14.12GB". Those should eventually stop showing up, as all the runners are cleaned up.

So is this job temporary?

No, I'm proposing this as a permanent addition to LightGBM's CI.

That way, we'll automatically be protected against disk-space issues in situations like the following:

switching to a new image for ubuntu-latest (via new Ubuntu version or just security patches pushed to the official Ubuntu images)
changes to the manylinux image (e.g. via new commits to https://github.com/guolinke/lightgbm-ci-docker)

.vsts-ci.yml

jameslamb · 2024-09-02T20:33:55Z

Thanks very much for the review @borchero !

get some diagnostic information about the linux runners

87d8013

jameslamb added in progress blocking maintenance labels Sep 2, 2024

jameslamb added 17 commits September 1, 2024 21:56

do not run in container

627c481

running directly on the host is not supported?

c7cad85

try docker commands

968bb60

maybe /tmp/docker cannot be accessed in bash script actions

2400f16

syntax

9c2c7ef

get more details on filesystem

c4c35b8

investigate /__t more

90aba4b

look at more directories

2671dc3

add a job to do routine maintenance

3da2516

choose a pool

2415711

fix container

fb961f8

fix syntax

249f208

try re-enabling CI jobs

5121fa2

revert setup.sh changes

4ac5f8b

add sudo setup back

52c1000

re-enable all CI

3f55849

print sizes in megabytes

ac03c1a

jameslamb changed the title ~~WIP: [ci] fix linux runners running out of disk space~~ [ci] fix linux runners running out of disk space (fixed #6635) Sep 2, 2024

jameslamb added awaiting review and removed in progress labels Sep 2, 2024

jameslamb marked this pull request as ready for review September 2, 2024 05:30

jameslamb requested review from guolinke, shiyu1994, jmoralez, borchero and StrikerRUS as code owners September 2, 2024 05:30

borchero reviewed Sep 2, 2024

View reviewed changes

.vsts-ci.yml Show resolved Hide resolved

jameslamb requested a review from borchero September 2, 2024 16:31

borchero approved these changes Sep 2, 2024

View reviewed changes

jameslamb removed awaiting review blocking labels Sep 2, 2024

jameslamb merged commit b9a2262 into master Sep 2, 2024
44 checks passed

jameslamb deleted the ci/investigate-ci branch September 2, 2024 20:34

jameslamb mentioned this pull request Sep 2, 2024

[ci] prevent Python tests from leaving behind files #6626

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] fix linux runners running out of disk space (fixed #6635) #6636

[ci] fix linux runners running out of disk space (fixed #6635) #6636

jameslamb commented Sep 2, 2024 •

edited

Loading

jameslamb commented Sep 2, 2024

[ci] fix linux runners running out of disk space (fixed #6635) #6636

[ci] fix linux runners running out of disk space (fixed #6635) #6636

Conversation

jameslamb commented Sep 2, 2024 • edited Loading

Notes for Reviewers

Why not run this cleanup on every CI job?

How to test this

So is this job temporary?

jameslamb commented Sep 2, 2024

jameslamb commented Sep 2, 2024 •

edited

Loading