Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRIU failed to restore LXC container in Ubuntu 22.04 #2421

Open
alexfrolov opened this issue Jun 18, 2024 · 4 comments
Open

CRIU failed to restore LXC container in Ubuntu 22.04 #2421

alexfrolov opened this issue Jun 18, 2024 · 4 comments
Assignees

Comments

@alexfrolov
Copy link

Description

CRIU failed to restore LXC container

Steps to reproduce the issue:

  1. Install criu (v3.16.1) and lxc (5.0.0) in Ubuntu 22.04.4 (5.15.0-112-generic)
  2. Create and run LXC container with Xenial:
sudo lxc-create --name=u2 --template=download -- --dist ubuntu --release xenial --arch amd64
sudo lxc-start u2
  1. Create checkpoint via lxc-checkpoint:
sudo lxc-checkpoint -s -n u2 -D /tmp/u2 -v

or directly running CRIU from shell:

/usr/sbin/criu.orig dump --tcp-established --file-locks --link-remap --manage-cgroups=full --ext-mount-map auto --enable-external-sharing --enable-external-masters --enable-fs hugetlbfs --enable-fs tracefs -D /tmp/u2 -o /tmp/u2/dump.log --cgroup-root cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:lxc.payload.u2 -v4 --ext-mount-map /sys/fs/fuse/connections:sys/fs/fuse/connections -t $APP_PID --skip-in-flight --freeze-cgroup /sys/fs/cgroup/lxc.payload.u2
  1. Restore from checkpoint via lxc-checkpoint:
sudo lxc-checkpoint -r -n u2 -D /tmp/u2 -v

or directly running CRIU from shell:

sudo /usr/sbin/criu.orig restore --tcp-established --file-locks --link-remap --manage-cgroups=full --ext-mount-map auto --enable-external-sharing --enable-external-masters --enable-fs hugetlbfs --enable-fs tracefs -D /tmp/u2 -o /tmp/u2/restore.log --cgroup-root cpuset,cpu,io,memory,hugetlb,pids,rdma,misc:lxc.payload.u2 -v4 --ext-mount-map sys/fs/fuse/connections:/sys/fs/fuse/connections --root /usr/lib/x86_64-linux-gnu/lxc --restore-detached --restore-sibling --ext-mount-map console: --external veth[eth0]:vethnH9tQz@lxcbr0

Describe the results you received:

The restoring process fails with:

...
(00.022063)      1: mnt: Start with 0:/tmp/.criu.mntns.MwBUXU
(00.022080)      1: mnt: Start with 0:/tmp/.criu.mntns.MwBUXU
(00.022368)      1: mnt: Start with 0:/tmp/.criu.mntns.MwBUXU
(00.022374)      1: mnt:        Mounting **unsupported** @/tmp/.criu.mntns.MwBUXU/13-0000000000/ (0)
(00.022383)      1: mnt: 745:/tmp/.criu.mntns.MwBUXU/13-0000000000/ private 0 shared 0 slave 1
(00.022393)      1: mnt:        Mounting tmpfs @/tmp/.criu.mntns.MwBUXU/13-0000000000/dev (0)
(00.022400)      1: Error (criu/mount.c:1979): mnt: Unable to mount none /tmp/.criu.mntns.MwBUXU/13-0000000000/dev (id=746): No such file or directory
(00.022404)      1: Error (criu/mount.c:2044): mnt: Can't mount at /tmp/.criu.mntns.MwBUXU/13-0000000000/dev: No such file or directory
(00.022406)      1: mnt: Start with 0:/tmp/.criu.mntns.MwBUXU
(00.025367) Error (criu/mount.c:3385): mnt: Can't remove the directory /tmp/.criu.mntns.MwBUXU: Device or resource busy
(00.025376) Error (criu/cr-restore.c:2447): Restoring FAILED.

Describe the results you expected:

The successful completion of restore operation.

Additional information you deem important (e.g. issue happens only occasionally):

The same issue happens with CRIU build from master branch.

CRIU logs and information:

dump.log
restore.log

$ sudo /usr/sbin/criu --version
Version: 3.16.1
$ sudo /usr/sbin/criu check --all
sudo: mon_handle_sigchld: waitpid: No child processes
Looks good.
@adrianreber
Copy link
Member

Looking at the error message I would say that is a lxc bug. I have had similar errors in runc/crun.

The expectation from CRIU is that the file-system is setup in such a way that it can restore all mount points. The destination directory for /dev in the container does not exist and this need to be created by lxc before calling CRIU.

@Snorch
Copy link
Member

Snorch commented Jul 1, 2024

I can't reproduce this error with self-compiled criu v3.19 (I see slightly different problem with non being able to kill cgroupd):

6.2.0-31-generic 
22.04.1-Ubuntu
lxc 1:6.0.0+main~20240626-1908-0ubuntu1~jammy

Steps:

lxc-create --name=u2 --template=download -- --dist ubuntu --release xenial --arch amd64

cat | sudo tee -a /var/lib/lxc/u2/config << EOF
# hax for criu
lxc.console.path = none
lxc.tty.max = 0
lxc.cgroup.devices.deny = c 5:1 rwm
EOF

lxc-start u2

lxc-checkpoint -s -n u2 -D /tmp/u2 -v
lxc-checkpoint -r -n u2 -D /tmp/u2 -v

<hangs>

In gdb we see criu waiting for cgroupd to die:

#1  0x00007fbfac0ea3ab in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
#2  0x00005586bc7b9ebc in stop_cgroupd () at criu/cgroup.c:2052
#3  0x00005586bc7c7606 in restore_root_task (init=0x7fbfac66c058) at criu/cr-restore.c:2401
#4  0x00005586bc7c8abd in cr_restore_tasks () at criu/cr-restore.c:2652
#5  0x00005586bc79e75b in main (argc=<optimized out>, argv=0x7ffcc630b908, envp=<optimized out>) at criu/crtools.c:308

#2  0x00005586bc7b9ebc in stop_cgroupd () at criu/cgroup.c:2052
2052			waitpid(cgroupd_pid, NULL, 0);

(gdb) p cgroupd_pid
$1 = 70472

cgroupd stack:
#0  __recvmsg_syscall (flags=0, msg=0x7ffc3cccade0, fd=8) at ../sysdeps/unix/sysv/linux/recvmsg.c:27
#1  __libc_recvmsg (fd=fd@entry=8, msg=msg@entry=0x7ffc3cccade0, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:41
#2  0x000055caa09649e1 in cgroupd (sk=8) at criu/cgroup.c:1968
#3  0x000055caa09a9b91 in start_unix_cred_daemon (pid=pid@entry=0x55caa0a98118 <cgroupd_pid>, daemon_func=daemon_func@entry=0x55caa0964930 <cgroupd>) at criu/namespaces.c:1489
#4  0x000055caa09673cf in prepare_cgroup_thread_sfd () at criu/cgroup.c:2064
#5  prepare_cgroup () at criu/cgroup.c:2242
#6  0x000055caa0975a9b in cr_restore_tasks () at criu/cr-restore.c:2643
#7  0x000055caa094b75b in main (argc=<optimized out>, argv=0x7ffc3cccc148, envp=<optimized out>) at criu/crtools.c:308

So if I just kill cgroupd (which waits for command on unix socket) everything else works:

:~# kill -9 70472
:~# tail /tmp/u2/restore.log 
(405.72011) 70473 was stopped
(405.72014) 70473 was trapped
(405.72014) 70473 (native) is going to execute the syscall 11, required is 11
(405.72017) 70473 was stopped
(405.72017) Run late stage hook from criu master for external devices
(405.72018) restore late stage hook for external plugin failed
(405.72018) Running pre-resume scripts
(405.72018) Restore finished successfully. Tasks resumed.
(405.72018) Writing stats
(405.72023) Running post-resume scripts

:~# lxc-ls -f
NAME STATE   AUTOSTART GROUPS IPV4 IPV6                                  UNPRIVILEGED 
u2   RUNNING 0         -      -    fc11:4514:1919:810:216:3eff:fe39:55a9 false

@Snorch
Copy link
Member

Snorch commented Jul 2, 2024

Fix for cgroupd problem here #2427

Copy link

github-actions bot commented Aug 2, 2024

A friendly reminder that this issue had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants