Utility to list duplicate files in one or more directories based on the file contents (rather than the name).
FileDedupe is a utility that checks one or more directories for duplicate files. Just run it with a list of directories on the command line. The default is to check all subdirectories. This can be controlled (see below). The output is a text file, which is written to stdout consists of the name of files that have duplicates. The file is given followed by its duplicates.
An article on this utility and how it was designed and written appears in Oracle's Java Magazine
Version 1.0 used a brute-force approach of running checksums on every file in the user-specified directories and then comparing the checksums to identify duplicates. This worked well, but was slow.
Version 2.0 uses comparisons of file sizes to greatly reduce the number of files that require checksums. It runs 9x-11x faster on the test directories. Use this version for your own needs. The optimization that delivered this benefit is described in this article in Oracle's Java Magazine
FileDedupe is written in Java 8. To run it, run the JAR file with the directory or directories to scan for duplicates. Note that directory of .
is supported.
Options:
-nosubdirs
this flag prevents FileDedupe from checking subdirectories for duplicates.
-help
or -h
or --h
: shows this usage information
So, to run the utility on in the current directory:
java -jar filededupe-2.0.jar .
The tests included here generate code coverage of 80%. And FileDedupe has been tested repeatedly on directories of more than 600,000 files.
David V. Saraiva forked the code presented here and added the ability to generate an HTML report of duplicates. His repository here.
Thanks to Oracle's Java Magazine for publishing the articles on this utility.
Thanks to JetBrains for supporting open source by providing a license to IntelliJ IDEA, which is an IDE that I have used since version 3.5.