Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm -rf on hardlink deletes original files #34

Open
ORESoftware opened this issue Sep 10, 2017 · 4 comments
Open

rm -rf on hardlink deletes original files #34

ORESoftware opened this issue Sep 10, 2017 · 4 comments

Comments

@ORESoftware
Copy link

ORESoftware commented Sep 10, 2017

It appears that

rm -rf on hard link deletes original files

this is dangerous and on Linux if you rm the hard link the original file is still intact.

Can you confirm / deny this behavior with your lib on MacOS?

@ORESoftware ORESoftware changed the title rm -rf on hardlink deletes original files rm -rf on hardlink deletes original files Sep 10, 2017
@BenjaminHCCarr
Copy link

@ORESoftware this is the intended/desired behavior on UNIX

All hardlinks point to the same inode so the same spot on the disk.
see:
http://www.farhadsaberi.com/linux_freebsd/2010/12/hard-link-soft-symbolic-links.html
https://www.freebsd.org/cgi/man.cgi?query=ln

If your linux distro is deviating from this it is not following the UNIX standard.

Hardlinks and symlinks act differently. If you delete a symlink eg: ln -s $source $target and then rm $target, you will still have $source, but the symlink is just a moveable pointer.

Often with symlinks if you delete $source you will end up with $target symlinks lying around "dead"


So yes, I can confirm deleting a hardlink deletes the inode so thus the original data. This is the desired behavior though.

@ORESoftware
Copy link
Author

ORESoftware commented Sep 11, 2017

@BenjaminHCCarr @selkhateeb is there a way to 'remove/undo the hardlink' without deleting the original files?

Do you know if hln will work on Linux? Or just MacOS?

@mhelvens
Copy link

mhelvens commented Oct 6, 2017

@BenjaminHCCarr: Unix standard? I don't think that's true. I don't know about FreeBSD, but both on my Linux box and my Macbook, deleting one of the references of a hard link (created with ln) leaves the others intact. Deleting the last reference deletes the inode. It uses reference counting.

I'd love to get that functionality here too.

@Swivelgames
Copy link

Swivelgames commented Jan 31, 2024

This is a bit of a necropost, but I wanted to put this out there since this is coming up in Google searches.

rm -rf is working as expected. For links, you want unlink

@ORESoftware The idiomatic way would be to use unlink, however I'm not sure if that applies to how this repo achieves hardlinked directories, and there are protections in place that try to prevent you from unlinking directories. It is expected that rm -rf will delete the directory and its contents, by nature of how the command works. -r works by first specifically purging files recursively until the directory is empty, and then deleting the directory from the filesystem.


@mhelvens Not for the contents of a directory, if the directory itself is unlinked, but not recursively. The behavior that @BenjaminHCCarr is describing is exactly correct, actually. And that can be disastrous on a larger scale, which is why hardlinked directories are generally discouraged.

Deep-dive into why Hardlinked Directories are difficult and dangerous

Directories are just files

In Unix, every file or directory is a "hardlink". In fact, in the actual physical data structure in ext-based filesystems, even Directories are just files, and their contents are just a map of filenames to their respective inodes:

foo,143927
bar,127694

This would represent a directory containing two hardlinks: foo and bar. Either foo or bar could be a real file (in the traditional sense) or another directory. In either case, they're treated the same.

The inode contains the metadata for the file including the type of file (like if its a true file or if its a directory), the permissions (or mode), the location of the data blocks on the drive where the contents are stored, and a reference counter that counts how many directories the inode is referenced in.

unlink effectively just deletes the entry from the directory its in and decrements the reference counter (in fact, when trying to find references after writing this, I found that this is explicitly how IBM describes the unlink command) So:

unlink foo

Would result in:

- foo,143927
  bar,127694

The process then checks to see if the reference counter is 0 for that specific inode (in this example 143927). If it is 0, then we can assume that no other directories are pointing to it, and then the blocks on the drive that it points to are freed for use by new files.

Why Hardlinked Directories are so difficult

So, if we have a hardlinked directory, we don't want to recursively it and its subfiles. We simply want to remove the pointer to that directory in that particular location.

In fact, one of the reasons why Hardlinked Directories are avoided is because of the potential for lost space that it introduces. For instance, theoretically, we could unlink the last pointer to a directory. That space on the drive that contains the list of filename to inode references itself would be "freed", but all of the files within the directory might be stuck on the drive forever and never freed.

-r to the rescue

This is why we have -r for rm. Because, in order to avoid the headaches described above, we need to explicitly delete each individual file before we unlink the directory itself. In fact, unlink doesn't work on directories, but only because the command itself is very simple and isn't built for that recursion, so it explicitly forbids it. That extra code would make the process of inefficient, and dangerous, especially if it's not something that we explicitly wanted to do.

Otherwise, the filesystem might not realize that those files are no longer being referenced by the directory, because we didn't actually touch those files. We didn't explicitly delete the individual references to them, we just deleted the last list that contained all of those references. With that gone for good, those inodes are orphaned forever without a host directory.

That would be a nightmare, and our large drive would quickly run out of space, and there'd be no telling why.

Even though people do, we shouldn't even rm a softlinked directory

The data loss @ORESoftware experienced is actually one of the reasons it is recommended to avoid using rm on softlinked directories in general, and to use unlink on them instead. Getting into the practice of using rm on directory links is dangerous. Instead, rm is a more destructive and capable version of unlink that we only want to use if we explicitly want to purge a directory's contents from the drive.

Further Reading

It's important to understand that unix filesystems don't distinguish between the "original" file/directory and the hardlink. Because of this, we could technically create a hardlink to a file/directory, and then unlink the original location and the data would still exist.

In fact, the move operation works exactly like this. It doesn't explicitly move the data. It simply adds the reference to the new directory, and then removes it from the old directory.

So mv on foo/bar to quux/bar does the following:

$ mv foo/bar quux/
@@ quux/
+ bar,134789
@@ foo/
- bar,134789

That's why move operations are so fast on unix systems! 🙂

I find all of this super fascinating, so I thought I'd share for those who didn't realize.


When a directory is unlinked, the references inside of that directory aren't checked. The only requirement is that the directory is empty, because doing this type of recursive checking would be way too performance intensive. So, instead, unlink simply denies you from being able to unlink something if its a directory. And rm denies you from unlinking a directory without -r if it isn't empty.

This is explicitly to prevent orphaned inodes. This isn't completely unavoidable, but that's why we have fsck. But imagine having to run fsck on an in-use filesystem any time you deleted a directory.

That's actually the origin of lost+found. Dangling/orphaned inodes are put in lost+found if their hardlink was destroyed, but their inode and data wasn't cleaned up. Instead of purging the inode and its data, the assumption is that whatever happened wasn't supposed to, so fsck creates a hardlink to the inode throws it in lost+found so that it can either be recovered or permanently destroyed with rm -rf.


Path forward

The only realistic path forward for this would be a custom wrapper for unlink that performs the white-glove checks to make sure that things are in order:

  • Check to see if there are other directories pointing to the inode by looking at the reference counter
  • If there are, its safe to remove the hardlink we asked it to
  • Otherwise, throw an error, saying that its the last link to the directory, so rm -rf must be used instead.
  • Also, if the target isn't a directory at all, we just run the built-in unlink on it, so we don't change the way it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants