Skip to content

Utility to list duplicate files in one or more directories based on the file contents

Notifications You must be signed in to change notification settings

platypusguy/FileDedupe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FileDedupe

Utility to list duplicate files in one or more directories based on the file contents (rather than the name).

What it is

FileDedupe is a utility that checks one or more directories for duplicate files. Just run it with a list of directories on the command line. The default is to check all subdirectories. This can be controlled (see below). The output is a text file, which is written to stdout consists of the name of files that have duplicates. The file is given followed by its duplicates.

An article on this utility and how it was designed and written appears in Oracle's Java Magazine

Version 1.0 used a brute-force approach of running checksums on every file in the user-specified directories and then comparing the checksums to identify duplicates. This worked well, but was slow.

Version 2.0 uses comparisons of file sizes to greatly reduce the number of files that require checksums. It runs 9x-11x faster on the test directories. Use this version for your own needs. The optimization that delivered this benefit is described in this article in Oracle's Java Magazine

How to run

FileDedupe is written in Java 8. To run it, run the JAR file with the directory or directories to scan for duplicates. Note that directory of . is supported. Options:

-nosubdirs this flag prevents FileDedupe from checking subdirectories for duplicates.

-help or -h or --h: shows this usage information

So, to run the utility on in the current directory:

java -jar filededupe-2.0.jar .

Testing

The tests included here generate code coverage of 80%. And FileDedupe has been tested repeatedly on directories of more than 600,000 files.

Extension: HTML Report

David V. Saraiva forked the code presented here and added the ability to generate an HTML report of duplicates. His repository here.

Thanks

Thanks to Oracle's Java Magazine for publishing the articles on this utility.

Thanks to JetBrains for supporting open source by providing a license to IntelliJ IDEA, which is an IDE that I have used since version 3.5.