Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to create pod cgroup: slice was already loaded or has a fragment file #24010

Open
edsantiago opened this issue Sep 18, 2024 · 5 comments
Labels
flakes Flakes from Continuous Integration

Comments

@edsantiago
Copy link
Member

[ Copy of https://github.com/containers/crun/issues/1560 ]

This is now the number one flake in parallel-podman-testing land. It is not manifesting in actual CI, only on my f40 laptop, and it's preventing me from parallelizing 250-systemd.bats:

not ok 204 |250| podman generate - systemd template no support for pod in 11179ms                                                                               
# tags: ci:parallel                                                                                                                                             
# (from function `bail-now' in file test/system/helpers.bash, line 192,                                                                                         
                                                                                                                                              576,1         69% 
# tags: ci:parallel                                                                                                                                             
# (from function `bail-now' in file test/system/helpers.bash, line 192,                                                                                         
#  from function `die' in file test/system/helpers.bash, line 969,
#  from function `run_podman' in file test/system/helpers.bash, line 571,
#  in test file test/system/250-systemd.bats, line 264)
#   `run_podman pod create --name $podname' failed
#
# [14:04:02.496829054] $ bin/podman pod create --name p-t204-pqf7zzmu
# [14:04:12.662179170] Error: unable to create pod cgroup for pod b88d54dc0463b4ce73430d04142df6c78b53facc773352d2974fd16e135d6fd8: creating cgroup user.slice/u
ser-14904.slice/[email protected]/user.slice/user-libpod_pod_b88d54dc0463b4ce73430d04142df6c78b53facc773352d2974fd16e135d6fd8.slice: Unit user-libpod_pod_b88d5
4dc0463b4ce73430d04142df6c78b53facc773352d2974fd16e135d6fd8.slice was already loaded or has a fragment file.
# [14:04:12.664998385] [ rc=125 (** EXPECTED 0 **) ]

The trigger is enabling parallel tests in 250-systemd.bats. It reproduces very quickly (80-90%) with file_tags=ci:parallel, but also reproduces (~40%) if I just do test_tags on the envar or systemd template tests. I have never seen this failure before adding tags to 250.bats, and have never seen it in any of the runs where I've removed the parallel tags from 250.bats. It is possible that service_setup() (which runs a bunch of systemctls) is to blame, but I am baffled as to how.

Kagi search finds containers/crun#1138 but that's OOM-related and I'm pretty sure nothing is OOMing.

Workaround is easy, don't parallelize 250.bats.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Sep 18, 2024
@giuseppe
Copy link
Member

just started looking into this. Is it safe to run multiple service_setup/service_cleanup in parallel?

@edsantiago
Copy link
Member Author

There's one part that I'm suspicious of and need to fix: the global SERVICE_NAME. I am working on a fix for that. However, that should only affect multiple same-file tests running at once. The fragment flake happens even if only one test from this file is parallel-enabled.

@edsantiago
Copy link
Member Author

The bug reproduces even with the most carefulest parallel-safe-paranoia I can write. And, still, even with only one test in the 250 file enabled.

edsantiago added a commit to edsantiago/libpod that referenced this issue Sep 24, 2024
Mostly just switch to safename. Rewrite setup() to guarantee
unique service file names, atomically created.

* IMPORTANT NOTE: enabling parallelization on these tests
  triggers containers#24010 ("fragment file" flake), but only on my
  f40 laptop. I have never seen the flake in Cirrus despite
  many many runs in containers#23275. I am submitting this for review
  and merging because even though _something_ is broken,
  this breakage is unlikely to affect our CI.

Signed-off-by: Ed Santiago <[email protected]>
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Based on a tip from the interwebz I ran systemctl --user reset-failed and the problem went away. Then it came back this week, and systemctl --user list-units --failed showed a ton of results, and it turns out our system tests are leaking some sort of systemd cruft. Lots of leaks in quadlet, a few in systemd tests, and some that I can't figure out in healthcheck tests. I've got a patch for the first two in my pet parallel PR, will test and submit one of these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flakes Flakes from Continuous Integration
Projects
None yet
Development

No branches or pull requests

2 participants