Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gix-lock locked access forever and could not recover #1026

Closed
kornelski opened this issue Sep 20, 2023 · 17 comments
Closed

gix-lock locked access forever and could not recover #1026

kornelski opened this issue Sep 20, 2023 · 17 comments
Labels
acknowledged an issue is accepted as shortcoming to be fixed

Comments

@kornelski
Copy link
Contributor

Current behavior 😯

I've upgraded rustsec crate to a version using gix, and after a while (probably after an unexpected restart) every fetch of rustsec's git database has been failing due to being locked forever.

unable to acquire filesystem lock: directory "{resource_path:?}" still locked after {attempts} attempts

BTW: the error was literally with the placeholders like that.

I've been restarting and retrying the process many times, and it was locked up every time.

I've noticed that in the parent dir of the path given to rustsec's Repository::fetch there was a 0-byte file rustsec.rustsec.lock. Deleting the file unblocked rustsec's fetch.

Backtrace of the process while it was waiting:

  frame #2: 0x0000aaaae1d97d2c crates-server`std::thread::sleep::h55cdaca4ca4be380 [inlined] std::sys::unix::thread::Thread::sleep::hfafd94621608adaa at thread.rs:237:20
    frame #3: 0x0000aaaae1d97d00 crates-server`std::thread::sleep::h55cdaca4ca4be380 at mod.rs:872:5
    frame #4: 0x0000aaaae1030418 crates-server`gix_lock::acquire::lock_with_mode::h11c2d5ec4a114104 at acquire.rs:123:25
    frame #5: 0x0000aaaae1098aac crates-server`rustsec::repository::git::repository::Repository::fetch::h099db295696cce7e [inlined] gix_lock::acquire::_$LT$impl$u20$gix_lock..Marker$GT$::acquire_to_hold_resource::hd64be3ad5ec63f0d at acquire.rs:78:35
    frame #6: 0x0000aaaae1098a8c crates-server`rustsec::repository::git::repository::Repository::fetch::h099db295696cce7e at repository.rs:113:21

Expected behavior 🤔

I think a lockfile has to include and check process pid, because a crash of the process can leave a lockfile behind, and it has to have a way to recover.

Also parent dir of the git checkout is a permanent location, so such lock will survive reboots. On Linux I'd expect lockfiles in /var/run.

Steps to reproduce 🕹

touch example_dir/rustsec.rustsec.lock
use rustsec::*;
Repository::fetch(DEFAULT_URL, "example_dir/rustsec", true, Duration::from_secs(60))?;
@Byron
Copy link
Member

Byron commented Sep 21, 2023

tldr; I think this is an issue that should probably be resolved in rustsec, but discussing it here might lead to usability improvements in gix-lock.

As an immediate action, all I can do here is to improve the documentation to warn about these side-effects if signal handler are not installed in the application. This, however, only pushes the can down the road as ultimately the application will have to know that certain crates are only safe to use if they install gix signal handlers or they are able to deal with lock-errors.

cargo for instance doesn't install signal handlers, but instead considers stray lock files corruption and clears the corresponding directory before retrying (and that's independent of its own lock files that use filesystem locks).


rustsec creates a lock file using gix-lock to be multi-process safe. gix-lock, however, is a git-style lock that is implementing a lock -> write-to-lock -> rename-to-overwrite-locked-resource flow to safely write resources on disk and prevent other processes to interfere. Being git-style also means that it needs signal handlers to be installed in order have a good chance of being cleaned up correctly. Otherwise, when a signal hits, the program is aborted and destructors don't run, leaving a stray lock file and locking the resource forever. git itself also suffers from this problem occasionally, but it's clear about its error reporting so the user knows which file to clean up.

The error reporting code that fails to substitute values also is in rustsec.

I think a lockfile has to include and check process pid, because a crash of the process can leave a lockfile behind, and it has to have a way to recover.

Also parent dir of the git checkout is a permanent location, so such lock will survive reboots. On Linux I'd expect lockfiles in /var/run.

Thanks for the suggestion. I'd like to add more options as this one isn't feasible to do with gix-lock today, even though it could be implemented on top of it.

  • use a crate like fslock to use filesystem locks - these are dropped automatically once the owning process shuts down, even if the lock file persists. This is the style of locking cargo uses as well, but unfortunately it doesn't expose it as crate.
  • Leave locking to the application level. This would allow it to create a simple gix-lock, or alternative, in /var/run on its own accord, and implement a PID file by hand if needed with the benefit of auto-clearing stray locks on reboot.
  • on application level, configure gix with the interrupt feature and set it up in the application. Note that SIGKILL may leave the resource locked anyway. This would resolve the issue right away as it would make the application gracefully abort even mid-clone. For reference, it's the same effect that you get when running gix clone <URL> and hitting ctrl+C, it uses the exact same handler mechanism.

Let me also reel in @Shnatsel who probably wants to know about this.

@Shnatsel
Copy link

Unfortunately the current gix locks are problematic by design - there is no way to recover from a SIGKILL delivered to the process. Once that happens, the lock remains stale forever.

Requiring manual intervention, like git CLI does, would be very problematic on servers - it would be a constanty source of outages.

There are two ways to go about this:

  1. Switch to fslock crate that avoids these issues by using OS-provided locks instead of lock files.
  2. Keep piling on more features on top of manually implemented lock files. Specifically write the PID to the file and have other processes check if that PID is still alive; and write the locks in a directory that is cleared on reboot, as suggested by @kornelski

I was trying to think of downsides of (1), but I could not find any that don't also apply to (2). Both won't work on truly obscure platforms such as Fuchsia, and both won't work over NFS (but there is no sane way to lock over NFS anyway).

So I suggest using fslock crate that translates to flock() on Unix and handles all the edge cases automatically - the lock is released once the process terminates even with SIGKILL, and is cleared on reboot.

@Shnatsel
Copy link

I've fixed error reporting in rustsec in rustsec/rustsec#1012

@Byron
Copy link
Member

Byron commented Sep 21, 2023

Thanks @Shnatsel for chiming in! Now that this issue is tracked in rustsec I will close this one as there is nothing more I can do here except for informing more prominently about the limitations of the git-style locks that implemented here (see #1027).

It's worth noting that locks created during clone may also remain, which could lead to refs which can't be updated, for example, in an otherwise fine repo. A great way to prevent this is to implement signal handlers on application level as these will abort every operation gracefully, and rustsec is implemented correctly to allow that (as it uses gix::interrupt::IS_INTERRUPTED). Of course, SIGKILL will always be a problem for locks of this kind but I'd assume that reboot will first try SIGTERM and only kill after a grace period.

@Byron Byron closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2023
@Shnatsel
Copy link

Unfortunately there is not much that can be done on rustsec level to fix this. I've described how the locks implemented by gix are inherently problematic in the message above, here. The proper fix is getting gix to use better locks, such as the fslock crate - or abandoning gix locking entirely, and using custom locking in both rustsec and tame-index crates (and everything else that depends on gix).

Is there a particular reason why gix locks are implemented the way they are? Is this for compatibility with git CLI?

@Byron
Copy link
Member

Byron commented Sep 21, 2023

Is there a particular reason why gix locks are implemented the way they are? Is this for compatibility with git CLI?

Yes, this is how git implements its locks, probably because they are, overall, more compatible and work on a wider range of filesystems.


Thanks for your persistence!

I was ducking away from the problem a bit as file-locks are notoriously difficult for all reasons already mentioned and I didn't really want to deal with it, again. But then it became clear that git-style locking will always be a problem for automated systems and force administrators to specifically deal with it. Isn't there a better way? After all, gix has the opportunity here to work better on server systems!

The only way forward is to use filesystem locks where possible. Doing this broadly would be breaking with git as all of the sudden, gix processes could think they acquired a lock (as filesystem lock) even if the lock file already existed. However, in case of a specific application, the lock-location will only ever content with other invocations of this application, which makes filesystem locks feasible.

My thought is to add filesystem locking like fslock as second level - once the (exclusive) lock can't be acquired as the file exists, and if compiled in, it will try to acquire a filesystem lock. This effectively takes ownership of stray locks when possible.

This solution, similar to setting up signal handlers, would still require the application to take action and enable this cargo feature in gix-lock (probably via gix for convenience). But as opposed to setting up signal handlers, it's just a cargo feature (no additional code needed), and it would work correctly in the presence of SIGKILL.

What do you think?

PS: I think I could adapt cargo's lock implementation as probably one of the best tested and most-run implementations around.

@kornelski
Copy link
Contributor Author

Locks being robust to SIGKILL is a very important case for me. Otherwise a single temporary failure becomes a permanent problem. SIGKILL can happen due to many reasons, e.g. OOM killer. In servers I usually have my own handler for SIGTERM, but the server doesn't always shut down fast enough and gets killed too.

@Byron
Copy link
Member

Byron commented Sep 21, 2023

@kornelski The solution I propose would probably work here, but requires you to pull in gix-lock to configure a feature, or gix if the cargo feature is forwarded. If this is acceptable to you, I will see to improving gix-lock right away. Please note that this must be opt-in and it should be opt-in on application level only (as opposed to rustsec making the decision for you) as such a change is incompatible with git and gix has to be compatible by default.

@kornelski
Copy link
Contributor Author

I'm surprised that this is incompatible, and needs extra configuration.

Do you know what libgit2 does? I've been using libgit2 for years, and never had an issue with a lock file. gix failed me in the first week of using it.

@Byron
Copy link
Member

Byron commented Sep 21, 2023

Do you know what libgit2 does?

A good point: libgit2 seems to use a very similar approach as git does. However, it doesn't expose locking functionality so rustsec couldn't easily use it in git2 times. Even though this setup probably lacked a more central, long-lived lock at this time, the chance for stray locks definitely wasn't zero. But maybe it was just low enough to not matter on a single server.
But it's all about the probabilities, right? cargo does a lot to be corruption-resistant for instance, which probably is the reason there weren't any/many issues with it in that regard either despite git2.

@Shnatsel
Copy link

@kornelski The reason why you didn't have issues with git2 is that rustsec simply didn't perform any locking back then.

@Shnatsel
Copy link

I don't think this behavior should be controlled by a Cargo feature. Features are additive and global, so any dependency anywhere in the tree enabling this feature will affect all the other dependencies. If any configuration options are provided, they should be done as an argument to a function, or as separate functions.

Acquiring two locks does not really solve the problem of the stale lock being left behind, so I don't think that is workable. Within the constraints of git compatibility, I believe the best we can do is write the process ID into the lockfile and check that the process with the ID from the file is still alive when trying to acquire the lock. And maybe also something like boot time, to catch instances of the PID being reused after a reboot.

I've looked at the Cargo implementation and it just uses OS locks everywhere, forgoing git locks and all the issues associated with them. It does seem that for crates like rustsec that do not need to coordinate with git CLI it is best to do what Cargo does and just use OS locks instead of git lockfiles.

@Byron
Copy link
Member

Byron commented Sep 21, 2023

I don't think this behavior should be controlled by a Cargo feature. Features are additive and global, so any dependency anywhere in the tree enabling this feature will affect all the other dependencies. If any configuration options are provided, they should be done as an argument to a function, or as separate functions.

This cargo feature is strictly for the application level. It's no similar to a few feature toggles in gix which are relevant only for performance and for that reason must not be set by a library, but only by the application. Yes, this would also mean that application developers that have the gix dependency within their tree would need to configure it to get max-performance. Otherwise, they only get the pure-Rust versions of certain algorithms, which builds everywhere by default.
The new feature that would solve this issue here would have to be made in such a way that it's perfectly transparent.

Acquiring two locks does not really solve the problem of the stale lock being left behind, so I don't think that is workable.

Instead of failing on a stale lock, it would instead try to acquire a filesystem lock, and if that is successful, it would take ownership of the stale lock which means it will be removed once the destructor runs. Also, encountering a stale lock then wouldn't be the end of it.
But it's a valid point - what's the lifecycle of the std::fs::File (which is required for a filesystem lock) that is needed to actually hold the lock then? And what should happen to locks that are only markers (or closed for that matter) which don't have a File, and thus don't consume system resources. gix-ref for example is relying on closed/marker locks to not consume resources when possibly holding thousands of them, but with the semantic change that I was proposing one would have to keep the File open to retain a lock and the process might run out of resources. Otherwise, other instances of the same applications would always be inclined to think that markers are stale even when they are not. So as a summary, filesystem locks would have great potential for resource exhaustion.
But wait, I think the times when gix-ref would possibly create and hold thousands of lock-files are over by now at least during cloning as it will write these into packed-refs right away.

Within the constraints of git compatibility, I believe the best we can do is write the process ID into the lockfile and check that the process with the ID from the file is still alive when trying to acquire the lock. And maybe also something like boot time, to catch instances of the PID being reused after a reboot.

Yes, it at least should be considered an option, one that I'd still have to implement behind a feature toggle as it will add costs while still not being perfect if the pid of the server process is accidentally the same. I guess that's another reason why these files are usually written in a location that is wiped after reboot.

I definitely lean towards filesystem locks though, just because they seem to work very well for cargo.

I've looked at the Cargo implementation and it just uses OS locks everywhere, forgoing git locks and all the issues associated with them. It does seem that for crates like rustsec that do not need to coordinate with git CLI it is best to do what Cargo does and just use OS locks instead of git lockfiles.

I agree, it's probably best to create this lock using filesystem locks, separately. I kept looking and found fs4 which seems to be recent in terms of dependencies and is maintained.

Maybe getting this single, long-lived lock right would already solve all problems. All other locks that would be created during a clone are to write resources, and these are typically short lived. Packs are the only long-lived resource, but these are normal tempfiles as their name isn't known until they are done streaming. That would get @kornelski back into the two years without (this kind of) failure territory I am sure :).

@Byron
Copy link
Member

Byron commented Sep 21, 2023

After looking into what the filesystem-lock feature would mean for gix-lock, I decided to hold off with any implementation as it would seriously complicate an otherwise pleasantly simple crate. If there is still demand for this, I'd reconsider though - just let me know. In the meantime, I think rustsec can use filesystem locks itself, and we can assume that all other operations on the repo will be just as likely to produce stale locks as previously with git2.

I found thinking about this issue very interesting though and believe when implementing a git server, I'd expect stale locks to exist. Probably the first time the server opens a repo, it would scan for stale locks and delete them, assuming no concurrent operations can exist as it would be the one to manage them anyway.

Closing once again as not planned, but can reopen if there is demand.

@Byron Byron closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2023
@Shnatsel
Copy link

tame-index has resolved this by using OS locking: EmbarkStudios/tame-index#33

I've opened a PR to switch from gix locks to tame-index's OS locking implementation in rustsec too: rustsec/rustsec#1032

@Byron
Copy link
Member

Byron commented Oct 1, 2023

Thanks for sharing, I am happy this finally gets resolved. Of course I hope that tame-index will spin off its cargo-inspired lock implementation into their own crate, I definitely want more of these, @Jake-Shadle :) 🙏.

@Shnatsel
Copy link

I've just published rustsec v0.28.3 that switches to OS locking and resolves the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged an issue is accepted as shortcoming to be fixed
Projects
None yet
Development

No branches or pull requests

3 participants