detect-duplicates

Detects and reports duplicate files in a file system.

This program runs in linear time with respect to the number of files being checked and loads (in expectation) only two files into memory at a time, and can thus be used to efficiently check huge amounts of very large files for duplicates.

As a benchmark, it can check the entire linux kernel repository (roughly 5GB, over 70,000 files) in about 5 seconds.

Installation:

You will need to install Rust in order to install this program. After installing Rust, installation is simply:

cargo install detect-duplicates

Usage:

This program takes a path to a directory and outputs any files which are copies of another file within that directory or any of its children. Each group of equal files will be output as a newline separated list. Groups of files are separated by two newlines.

For example, given the following directory structure:

files
├── a.txt
├── b.txt
└── more_files
    ├── c.txt
    ├── d.txt
    └── even_more_files
        ├── e.txt
        └── f.txt

If a.txt and e.txt have the same contents and b.txt, c.txt, and d.txt have the same contents, then this command:

detect-duplicates files

will produce the following output:

files/a.txt
files/more_files/even_more_files/e.txt

files/b.txt
files/more_files/c.txt
files/more_files/d.txt

And this command:

detect-duplicates files/more_files

would output:

files/more_files/c.txt
files/more_files/d.txt

Technical details:

Comparing large files is expensive, so we want to minimize the amount of comparisons done. Reading files is also expensive, but not as expensive as buying more RAM. We can achieve a linear number of comparisons while reading each file at most twice by doing the following:

We compute a hash of the contents of each file, and store a mapping from the hashes to lists of files with that same hash.
For each list of files with at least two elements with the same hash, we store a mapping from the contents to the lists of files with the same contents.
We report as duplicates all files which have the same key in the second map.

Because we use high quality 64 bit hashes, the probability of more than one file having the same content hash but different contents is roughly¹ #files with same content hash/2^64, which should be extremely small. Thus, we expect there to only be one key in the second map at a time and one file loaded, which we need to compare to the value in the map, meaning that, at most, we expect to have two files loaded simultaneously into RAM. Each file is loaded at most twice in total; once to compute the key for the first map, and once to compute the key in the second map, if applicable.

The exact value is 1 - ((k - 1) / k) ** F, where F is the number of files with the same content hash and k is 2^64, which is about the same as F/k for practical values (F < 1,000,000,000). ↩

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

detect-duplicates

Installation:

Usage:

Technical details:

About

Releases

Packages

Languages

License

kpberry/detect-duplicates

Folders and files

Latest commit

History

Repository files navigation

detect-duplicates

Installation:

Usage:

Technical details:

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages