Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard linked files are sometimes missed #29

Open
lbmiller opened this issue Sep 8, 2016 · 4 comments
Open

Hard linked files are sometimes missed #29

lbmiller opened this issue Sep 8, 2016 · 4 comments

Comments

@lbmiller
Copy link

lbmiller commented Sep 8, 2016

In a situation where two directories have files of the same name which are hard links the 2nd is not reported. For example the tzdata package installs many copies of the same files into /usr/share/zoneinfo such that files like /usr/share/zoneinfo/GMT0 and /usr/share/zoneinfo/Etc/GMT0 (and many other examples) are hard links.

Both GMT0 and Etc/GMT0 should be reported when traversing /usr/share/zoneinfo.

@lbmiller
Copy link
Author

lbmiller commented Sep 8, 2016

The file's basename and inode are tracked in a hash, and another file is not reported if it has the same inode and basename. I believe that map is to help when following a symlink, if the target directory would also be traversed on its own. In this case the match is a false-positive. One solution (though not perfect) is to ignore the match in the hash if the target file is actually a hard link. This helps narrow one corner case, but still leaves other more obscure corner cases.

I am providing two possible (imperfect) solutions for your review.

  1. Only suppress the duplicate report if the file is a real hard link (nlink > 1)
    https://github.com/lbmiller/node-walkdir/tree/hardlinks
  2. Same, but disabled by default; opt-in using the {report_hard_links:true} option
    https://github.com/lbmiller/node-walkdir/tree/hardlinks-opt

I suspect a more complete solution would require keeping track of symlinks encountered during the scan and then resolving and reporting any dangling symlink targets after all other files are reported.

@soldair
Copy link
Owner

soldair commented Sep 12, 2016

I'm not sure i understand how to use nlink like you propose. But ill check it out. thanks for the examples.

The reason we keep the hash is to prevent infinite recursion when following symlinks to directories. Generally the behavior is too broad now. In that we never report a file we have seen before.

We should probably change the behavior to never list a directory we have listed before and bump the major version.

the basename thing seems like a bug disaster waiting to happen on windows where ino is empty

@MarkDuckworth
Copy link

For what it's worth I'm seeing similar behavior walking node_modules on a Windows 8.1 machine. I'm not sure if hard links are at play in my case, but we see files with the same name occasionally skipped. As a simple example: consider a node_modules folder that includes node_modules/a_module/index.js and node_modules/b_module/index.js, one of these files may be skipped.

@lbmiller
Copy link
Author

lbmiller commented Jan 5, 2017

@MarkDuckworth I'm curious whether either of my proposed fixes would work in your situation. Are you able to try them? Also in your case are the contents of the two files the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants