unable to create pod cgroup: slice was already loaded or has a fragment file #24010

edsantiago · 2024-09-18T20:27:34Z

[ Copy of https://github.com/containers/crun/issues/1560 ]

This is now the number one flake in parallel-podman-testing land. It is not manifesting in actual CI, only on my f40 laptop, and it's preventing me from parallelizing 250-systemd.bats:

not ok 204 |250| podman generate - systemd template no support for pod in 11179ms                                                                               
# tags: ci:parallel                                                                                                                                             
# (from function `bail-now' in file test/system/helpers.bash, line 192,                                                                                         
                                                                                                                                              576,1         69% 
# tags: ci:parallel                                                                                                                                             
# (from function `bail-now' in file test/system/helpers.bash, line 192,                                                                                         
#  from function `die' in file test/system/helpers.bash, line 969,
#  from function `run_podman' in file test/system/helpers.bash, line 571,
#  in test file test/system/250-systemd.bats, line 264)
#   `run_podman pod create --name $podname' failed
#
# [14:04:02.496829054] $ bin/podman pod create --name p-t204-pqf7zzmu
# [14:04:12.662179170] Error: unable to create pod cgroup for pod b88d54dc0463b4ce73430d04142df6c78b53facc773352d2974fd16e135d6fd8: creating cgroup user.slice/u
ser-14904.slice/[email protected]/user.slice/user-libpod_pod_b88d54dc0463b4ce73430d04142df6c78b53facc773352d2974fd16e135d6fd8.slice: Unit user-libpod_pod_b88d5
4dc0463b4ce73430d04142df6c78b53facc773352d2974fd16e135d6fd8.slice was already loaded or has a fragment file.
# [14:04:12.664998385] [ rc=125 (** EXPECTED 0 **) ]

The trigger is enabling parallel tests in 250-systemd.bats. It reproduces very quickly (80-90%) with file_tags=ci:parallel, but also reproduces (~40%) if I just do test_tags on the envar or systemd template tests. I have never seen this failure before adding tags to 250.bats, and have never seen it in any of the runs where I've removed the parallel tags from 250.bats. It is possible that service_setup() (which runs a bunch of systemctls) is to blame, but I am baffled as to how.

Kagi search finds containers/crun#1138 but that's OOM-related and I'm pretty sure nothing is OOMing.

Workaround is easy, don't parallelize 250.bats.

The text was updated successfully, but these errors were encountered:

giuseppe · 2024-09-19T13:30:14Z

just started looking into this. Is it safe to run multiple service_setup/service_cleanup in parallel?

edsantiago · 2024-09-19T14:12:44Z

There's one part that I'm suspicious of and need to fix: the global SERVICE_NAME. I am working on a fix for that. However, that should only affect multiple same-file tests running at once. The fragment flake happens even if only one test from this file is parallel-enabled.

edsantiago · 2024-09-19T21:24:47Z

The bug reproduces even with the most carefulest parallel-safe-paranoia I can write. And, still, even with only one test in the 250 file enabled.

Mostly just switch to safename. Rewrite setup() to guarantee unique service file names, atomically created. * IMPORTANT NOTE: enabling parallelization on these tests triggers containers#24010 ("fragment file" flake), but only on my f40 laptop. I have never seen the flake in Cirrus despite many many runs in containers#23275. I am submitting this for review and merging because even though _something_ is broken, this breakage is unlikely to affect our CI. Signed-off-by: Ed Santiago <[email protected]>

github-actions · 2024-10-20T00:09:50Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2024-10-22T20:26:26Z

Based on a tip from the interwebz I ran systemctl --user reset-failed and the problem went away. Then it came back this week, and systemctl --user list-units --failed showed a ton of results, and it turns out our system tests are leaking some sort of systemd cruft. Lots of leaks in quadlet, a few in systemd tests, and some that I can't figure out in healthcheck tests. I've got a patch for the first two in my pet parallel PR, will test and submit one of these days.

edsantiago added the flakes Flakes from Continuous Integration label Sep 18, 2024

edsantiago mentioned this issue Sep 23, 2024

CI: make systemd tests parallel-safe (*) #24048

Merged

github-actions bot added the stale-issue label Oct 20, 2024

edsantiago removed the stale-issue label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to create pod cgroup: slice was already loaded or has a fragment file #24010

unable to create pod cgroup: slice was already loaded or has a fragment file #24010

edsantiago commented Sep 18, 2024

giuseppe commented Sep 19, 2024

edsantiago commented Sep 19, 2024

edsantiago commented Sep 19, 2024

github-actions bot commented Oct 20, 2024

edsantiago commented Oct 22, 2024

unable to create pod cgroup: slice was already loaded or has a fragment file #24010

unable to create pod cgroup: slice was already loaded or has a fragment file #24010

Comments

edsantiago commented Sep 18, 2024

giuseppe commented Sep 19, 2024

edsantiago commented Sep 19, 2024

edsantiago commented Sep 19, 2024

github-actions bot commented Oct 20, 2024

edsantiago commented Oct 22, 2024