forked from h2oai/jdupes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathjdupes.1
402 lines (369 loc) · 14.5 KB
/
jdupes.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
.TH JDUPES 1
.\" NAME should be all caps, SECTION should be 1-8, maybe w/ subsection
.\" other parms are allowed: see man(7), man(1)
.SH NAME
jdupes \- finds and performs actions upon duplicate files
.SH SYNOPSIS
.B jdupes
[
.I options
]
.I DIRECTORIES
\|.\|.\|.
.SH "DESCRIPTION"
Searches the given path(s) for duplicate files. Such files are found by
comparing file sizes, then partial and full file hashes, followed by a
byte-by-byte comparison. The default behavior with no other "action
options" specified (delete, summarize, link, dedupe, etc.) is to print
sets of matching files.
.SH OPTIONS
.TP
.B -@ --loud
output annoying low-level debug info while running
.TP
.B -0 --print-null
when printing matches, use null bytes instead of CR/LF bytes, just
like 'find -print0' does. This has no effect with any action mode other
than the default "print matches" (delete, link, etc. will still print
normal line endings in the output.)
.TP
.B -1 --one-file-system
do not match files that are on different filesystems or devices
.TP
.B -A --no-hidden
exclude hidden files from consideration
.TP
.B -B --dedupe
call same-extents ioctl or clonefile() to trigger a filesystem-level
data deduplication on disk (known as copy-on-write, CoW, cloning, or
reflink); only a few filesystems support this (BTRFS; XFS when mkfs.xfs
was used with -m crc=1,reflink=1; Apple APFS)
.TP
.B -C --chunk-size=\fInumber-of-KiB\fR
set the I/O chunk size manually; larger values may improve performance
on rotating media by reducing the number of head seeks required, but
also increases memory usage and can reduce performance in some cases
.TP
.B -D --debug
if this feature is compiled in, show debugging statistics and info
at the end of program execution
.TP
.B -d --delete
prompt user for files to preserve, deleting all others (see
.B CAVEATS
below)
.TP
.B -e --error-on-dupe
exit on any duplicate found with status code 255
.TP
.B -f --omit-first
omit the first file in each set of matches
.TP
.B -H --hard-links
normally, when two or more files point to the same disk area they are
treated as non-duplicates; this option will change this behavior
.TP
.B -h --help
displays help
.TP
.B -i --reverse
reverse (invert) the sort order of matches
.TP
.B -I --isolate
isolate each command-line parameter from one another; only match if the
files are under different parameter specifications
.TP
.B -j --json
produce JSON (machine-readable) output
.TP
.B -L --link-hard
replace all duplicate files with hardlinks to the first file in each set
of duplicates
.TP
.B -m --summarize
summarize duplicate file information
.TP
.B -M --print-summarize
print matches and summarize the duplicate file information at the end
.TP
.B -N --no-prompt
when used together with \-\-delete, preserve the first file in each set of
duplicates and delete the others without prompting the user
.TP
.B -O --param-order
parameter order preservation is more important than the chosen sort; this
is particularly useful with the \fB\-N\fP option to ensure that automatic
deletion behaves in a controllable way
.TP
.B -o --order\fR=\fIWORD\fR
order files according to WORD:
time - sort by modification time
name - sort by filename (default)
.TP
.B -p --permissions
don't consider files with different owner/group or permission bits as
duplicates
.TP
.B -P --print=type
print extra information to stdout; valid options are:
early - matches that pass early size/permission/link/etc. checks
partial - files whose partial hashes match
fullhash - files whose full hashes match
.TP
.B -Q --quick
.B [WARNING: RISK OF DATA LOSS, SEE CAVEATS]
skip byte-for-byte verification of duplicate pairs (use hashes only)
.TP
.B -q --quiet
hide progress indicator
.TP
.B -R --recurse:
for each directory given after this option follow subdirectories
encountered within (note the ':' at the end of option; see the
Examples section below for further explanation)
.TP
.B -r --recurse
for every directory given follow subdirectories encountered within
.TP
.B -l --link-soft
replace all duplicate files with symlinks to the first file in each set
of duplicates
.TP
.B -S --size
show size of duplicate files
.TP
.B -s --symlinks
follow symlinked directories
.TP
.B -T --partial-only
.B [WARNING: EXTREME RISK OF DATA LOSS, SEE CAVEATS]
match based on hash of first block of file data, ignoring the rest
.TP
.B -U --no-trav-check
disable double-traversal safety check (BE VERY CAREFUL)
.TP
.B -u --print-unique
print only a list of unique (non-duplicate, unmatched) files
.TP
.B -v --version
display jdupes version and compilation feature flags
.TP
.B -y --hash-db=file
create/use a hash database text file to speed up future runs by
caching file hash data
.TP
.B -X --ext-filter=spec:info
exclude/filter files based on specified criteria; general format:
.B jdupes -X filter[:value][size_suffix]
Some filters take no value or multiple values. Filters that can take
a numeric option generally support the size multipliers K/M/G/T/P/E
with or without an added iB or B. Multipliers are binary-style unless
the -B suffix is used, which will use decimal multipliers. For example,
16k or 16kib = 16384; 16kb = 16000. Multipliers are case-insensitive.
Filters have cumulative effects: jdupes -X size+:99 -X size-:101 will
cause only files of exactly 100 bytes in size to be included.
Extension matching is case-insensitive.
Path substring matching is case-sensitive.
Supported filters are:
.RS
.IP `size[+-=]:number[suffix]'
match only if size is greater (+), less than (-), or equal to (=) the
specified number. The +/- and = specifiers can be combined, i.e.
"size+=:4K" will only consider files with a size greater than or equal
to four kilobytes (4096 bytes).
.IP `noext:ext1[,ext2,...]'
exclude files with certain extension(s), specified as a comma-separated
list. Do not use a leading dot.
.IP `onlyext:ext1[,ext2,...]'
only include files with certain extension(s), specified as a comma-separated
list. Do not use a leading dot.
.IP `nostr:text_string'
exclude all paths containing the substring text_string. This scans the full
file path, so it can be used to match directories: -X nostr:dir_name/
.IP `onlystr:text_string'
require all paths to contain the substring text_string. This scans the full
file path, so it can be used to match directories: -X onlystr:dir_name/
.IP `newer:datetime`
only include files newer than specified date.
Date/time format: "YYYY-MM-DD HH:MM:SS" (time is optional).
.IP `older:datetime`
only include files older than specified date.
Date/time format: "YYYY-MM-DD HH:MM:SS" (time is optional).
.RE
.TP
.B -z --zero-match
consider zero-length files to be duplicates; this replaces the old
default behavior when \fB\-n\fP was not specified
.TP
.B -Z --soft-abort
if the user aborts the program (as with CTRL-C) act on the matches that
were found before the abort was received. For example, if -L and -Z are
specified, all matches found prior to the abort will be hard linked. The
default behavior without -Z is to abort without taking any actions.
.SH NOTES
A set of arrows are used in hard linking to show what action was taken on
each link candidate. These arrows are as follows:
.TP
.B ---->
This file was successfully hard linked to the first file in the duplicate
chain
.TP
.B -@@->
This file was successfully symlinked to the first file in the chain
.TP
.B -##->
This file was successfully cloned from the first file in the chain
.TP
.B -==->
This file was already a hard link to the first file in the chain
.TP
.B -//->
Linking this file failed due to an error during the linking process
.PP
Duplicate files are listed together in groups with each file displayed on a
separate line. The groups are then separated from each other by blank lines.
.SH EXAMPLES
.TP
.B jdupes a --recurse: b
will follow subdirectories under b, but not those under a.
.TP
.B jdupes a --recurse b
will follow subdirectories under both a and b.
.TP
.B jdupes -O dir1 dir3 dir2
will always place 'dir1' results first in any match set (where relevant)
.SH CAVEATS
Using
.B \-1
or
.BR \-\-one\-file\-system
prevents matches that cross filesystems, but a more relaxed form of this
option may be added that allows cross-matching for all filesystems that
each parameter is present on.
When using
.B \-d
or
.BR \-\-delete ,
care should be taken to insure against accidental data loss.
.B \-Z
or
.BR \-\-soft\-abort
used to be --hardabort in jdupes prior to v1.5 and had the opposite behavior.
Defaulting to taking action on abort is probably not what most users would
expect. The decision to invert rather than reassign to a different option
was made because this feature was still fairly new at the time of the change.
The
.B \-O
or
.BR \-\-param\-order
option allows the user greater control over what appears in the first
position of a match set, specifically for keeping the \fB\-N\fP option
from deleting all but one file in a set in a seemingly random way. All
directories specified on the command line will be used as the sorting
order of result sets first, followed by the sorting algorithm set by
the \fB\-o\fP or \fB\-\-order\fP option. This means that the order of
all match pairs for a single directory specification will retain the
old sorting behavior even if this option is specified.
When used together with options
.B \-s
or
.BR \-\-symlink ,
a user could accidentally preserve a symlink while deleting the file it
points to.
The
.B \-Q
or
.BR \-\-quick
option only reads each file once, hashes it, and performs comparisons
based solely on the hashes. There is a small but significant risk of a
hash collision which is the purpose of the failsafe byte-for-byte
comparison that this option explicitly bypasses. Do not use it on ANY data
set for which any amount of data loss is unacceptable. This option is not
included in the help text for the program due to its risky nature.
.B You have been warned!
The
.B \-T
or
.BR \-\-partial\-only
option produces results based on a hash of the first block of file data
in each file, ignoring everything else in the file. Partial hash checks
have always been an important exclusion step in the jdupes algorithm,
usually hashing the first 4096 bytes of data and allowing files that are
different at the start to be rejected early. In certain scenarios it may
be a useful heuristic for a user to see that a set of files has the same
size and the same starting data, even if the remaining data does not
match; one example of this would be comparing files with data blocks that
are damaged or missing such as an incomplete file transfer or checking a
data recovery against known-good copies to see what damaged data can be
deleted in favor of restoring the known-good copy. This option is meant
to be used with informational actions and
.B can result in EXTREME DATA LOSS
if used with options that delete files, create hard links, or perform
other destructive actions on data based on the matching output. Because
of the potential for massive data destruction,
.B this option MUST BE SPECIFIED TWICE
to take effect and will error out if it is only specified once.
Using the
.B \-C
or
.BR \-\-chunk\-size
option to override I/O chunk size can increase performance on rotating
storage media by reducing "head thrashing," reading larger amounts of
data sequentially from each file. This tunable size can have bad side
effects; the default size maximizes algorithmic performance without
regard to the I/O characteristics of any given device and uses a modest
amount of memory, but other values may greatly increase memory usage or
incur a lot more system call overhead. Try several different values to
see how they affect performance for your hardware and data set. This
option does not affect match results in any way, so even if it slows
down the file matching process it will not hurt anything.
The
.B \-y
or
.BR \-\-hash\-db
feature creates and maintains a text file with a list of
file paths, hashes, and other metadata that enables jdupes to "remember" file
data across runs. Specifying a period '.' as the database file name will use a
name of "jdupes_hashdb.txt" instead; this alias makes it easy to use the hash
database feature without typing a descriptive name each time. THIS FEATURE IS
CURRENTLY UNDER DEVELOPMENT AND HAS MANY QUIRKS. USE IT AT YOUR OWN RISK. In
particular, one of the biggest problems with this feature is that it stores
every path exactly as specified on the command line; if any paths are passed
into jdupes on a subsequent run with a different prefix then they will not be
recognized and they will be treated as totally different files. For example,
running \fBjdupes \-y . foo/\fP is not the same as \fBjdupes \-y . ./foo\fP nor the same
as (from a sibling directory) \fBjdupes \-y ../foo\fP. You must run jdupes from the
same working directory and with the same path specifications to take advantage
of the hash database feature. When used correctly, a fully populated hash
database can reduce subsequent runs with hundreds of thousands of files that
normally take a very long time to run down to the directory scanning time plus
a couple of seconds. If the directory data is already in the OS disk cache,
this can make subsequent runs with over 100K files finish in under one second.
.SH REPORTING BUGS
Send bug reports and feature requests to [email protected], or for general
information and help, visit www.jdupes.com
.SH SUPPORTING DEVELOPMENT
If you find this software useful, please consider financially supporting
its development through the author's home page:
https://www.jodybruchon.com/
.SH AUTHOR
jdupes is created and maintained by Jody Bruchon <[email protected]>
and was forked from fdupes 1.51 by Adrian Lopez <[email protected]>
.SH LICENSE
MIT License
Copyright (c) 2015-2023 Jody Lee Bruchon <[email protected]>
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.