-
Notifications
You must be signed in to change notification settings - Fork 4
/
dedupe_files_from_DROID_report.csvs
48 lines (48 loc) · 2.53 KB
/
dedupe_files_from_DROID_report.csvs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
version 1.1
@totalColumns 18
/*-----------------------------------------------------------------------------\
| This schema is designed to run on a CSV exported from The National Archives' |
| DROID file format identification tool profile. It will report an error for |
| any duplicate files within the report, allowing you to deduplicate material |
| held within the area of interest. |
| |
| When exporting the CSV from DROID, choose the option to have one row per |
| format. |
| |
| The error report will give the data line number of the error, and also the |
| data line number where the CSV Validator first encountered that checksum. |
| These line numbers do not include the header row, so if viewing your CSV file|
| in Excel or OpenOffice Calc you will need to add one to the reported line |
| number to find the relevant entry. |
| |
| As folders do not have a checksum, we exclude them from the uniqueness check |
| by examining the URI field, if the URI ends with a slash, we check just that |
| the SHA256_HASH is empty. If the RESOURCE column from DROID were included in |
| the output could use if($RESOURCE/is("folder") to test instead. |
| |
| To handle the case of files identifying with multiple possible formats, we |
| check the checksum for uniqueness only for the first identification. |
| |
| Authors: |
| Rachel MacGregor, Modern Records Centre, University of Warwick |
| David Underdown, The National Archives |
\-----------------------------------------------------------------------------*/
ID:
PARENT_ID:
URI:
FILE_PATH:
NAME:
METHOD:
STATUS:
SIZE:
TYPE:
EXT:
LAST_MODIFIED:
EXTENSION_MISMATCH:
SHA256_HASH: if($URI/ends("/"),empty,if($FORMAT_COUNT/is("1"),unique)) //folders do not have a checksum
// ignore additional format IDs
FORMAT_COUNT:
PUID:
MIME_TYPE:
FORMAT_NAME:
FORMAT_VERSION: