You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 30, 2021. It is now read-only.
I have seen at least three cases of smd attempting to destroy my entire mail spool. In the three cases, I have mostly recovered thanks to backups, notmuch, and the distributed nature of SMD/email. But it's nerve-wracking and, well, really, really bad.
I still don't quite understand what happens. I can't find it in the logs anymore, but I remember seeing something about smd tripping over an empty directory inside the .smd directory. I wrote about his on my blog here:
I think the fundamental problem is that smd-pull and smd-push's locking mechanisms might have a bug. I was able to start both processes at once by furiously hitting "reload" in my email setup (essentially telling systemd to repeately run smd-pull.service), which eventually started at the same time as smd-push. Here's a log from systemd:
jun 28 16:30:04 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Succeeded.
jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir.
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Failed with result 'exit-code'.
jun 28 16:30:09 curie systemd[2845]: Failed to start pull emails with syncmaildir.
jun 28 16:30:22 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:22 curie smd-pull[391140]: Already running.
jun 28 16:30:22 curie smd-pull[391140]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
jun 28 16:30:22 curie smd-pull[391140]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 390135) run(rm /home/anarcat/.smd/lock))
jun 28 16:30:22 curie systemd[2845]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
jun 28 16:30:22 curie systemd[2845]: smd-pull.service: Failed with result 'exit-code'.
jun 28 16:30:22 curie systemd[2845]: Failed to start pull emails with syncmaildir.
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Main process exited, code=killed, status=15/TERM
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Failed with result 'signal'.
jun 28 16:31:14 curie systemd[2845]: Stopped push emails with syncmaildir.
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Consumed 6.600s CPU time.
jun 28 16:31:22 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:31:23 curie smd-push[391563]: Found lockfile of a dead instance. Ignored.
jun 28 16:31:30 curie smd-push[392598]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(216)
jun 28 16:31:30 curie systemd[2845]: smd-push.service: Succeeded.
jun 28 16:31:30 curie systemd[2845]: Finished push emails with syncmaildir.
jun 28 16:31:30 curie systemd[2845]: smd-push.service: Consumed 4.727s CPU time.
jun 28 16:31:38 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:32:22 curie smd-pull[393092]: default: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(51458), bytes-received(0), xdelta-received(247)
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Main process exited, code=killed, status=15/TERM
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Failed with result 'signal'.
jun 28 16:32:22 curie systemd[2845]: Stopped pull emails with syncmaildir.
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Consumed 12.632s CPU time.
The first thing to notice is that, at 16:30:09, two jobs are started in parallel. This is a bug in the systemd configuration, which shouldn't allow that in the first place. But considering that smd has its own locking primitives, I consider it's also a bug there. At 16:30:22, you see me triggering another run, and thensmd-pull notices its other smd-pull process running and aborts. But it's still running because it's actively destroying all mail on the remote server!!
At 16:31:14 you can see I notice the problem and kill the process. And then at 16:32:22 you can see the timer i forgot to disable fired and destroyed thousands of emails locally.
Out fo 300k emails, I lost 220k in this case.
I'll send a PR fixing the systemd configuration, but I can't help but think there should be safeguards in SMD preventing it from destroying everything like this.
The text was updated successfully, but these errors were encountered:
anarcat
added a commit
to anarcat/syncmaildir
that referenced
this issue
Jun 29, 2021
that does look "racey": if two programs run this, one could check for the lockfile existence ([ -f $LOCKFILE ]) right before the other creates it (echo $$ > $LOCKFILE) not find it, and then overwrite the other program's lockfile...
typically, you would call flock to do atomic locking, but it doesn't work in all situations either (e.g. NFS). locking is hard, particularly from shell...
I have seen at least three cases of smd attempting to destroy my entire mail spool. In the three cases, I have mostly recovered thanks to backups, notmuch, and the distributed nature of SMD/email. But it's nerve-wracking and, well, really, really bad.
I still don't quite understand what happens. I can't find it in the logs anymore, but I remember seeing something about smd tripping over an empty directory inside the
.smd
directory. I wrote about his on my blog here:https://anarc.at/blog/2021-03-22-email-crash/
https://anarc.at/blog/2021-06-29-another-mail-crash/
I think the fundamental problem is that
smd-pull
andsmd-push
's locking mechanisms might have a bug. I was able to start both processes at once by furiously hitting "reload" in my email setup (essentially telling systemd to repeately runsmd-pull.service
), which eventually started at the same time assmd-push
. Here's a log from systemd:The first thing to notice is that, at 16:30:09, two jobs are started in parallel. This is a bug in the systemd configuration, which shouldn't allow that in the first place. But considering that smd has its own locking primitives, I consider it's also a bug there. At 16:30:22, you see me triggering another run, and then
smd-pull
notices its othersmd-pull
process running and aborts. But it's still running because it's actively destroying all mail on the remote server!!At 16:31:14 you can see I notice the problem and kill the process. And then at 16:32:22 you can see the timer i forgot to disable fired and destroyed thousands of emails locally.
Out fo 300k emails, I lost 220k in this case.
I'll send a PR fixing the systemd configuration, but I can't help but think there should be safeguards in SMD preventing it from destroying everything like this.
The text was updated successfully, but these errors were encountered: