Tools for working with MARC data in Catalogue Bridge.
Borrows heavily from PyMarc (https://pypi.org/project/pymarc/).
Requires the regex module from https://bitbucket.org/mrabarnett/mrab-regex. The built-in re module is not sufficient.=
From GitHub:
python -m pip install git+https://github.com/victoriamorris/CatBridge.git@main
To create stand-alone executable (.exe) files for individual scripts from downloaded source code:
python -m PyInstaller bin/<script_name>.py -F
Executable files will be created in the folder \dist, and should be copied to an executable path.
Both of the above commands can be carried out by running the shell script:
compile_catbridge_tools.sh
The scripts listed below can be run from anywhere, once the package is installed and the .exe files have been copied to an executable path.
Original Catalogue Bridge tool | New tool | Original syntax | Corresponding new syntax |
---|---|---|---|
cn-find | cn_find | CN-FIND <infile> <outfile> <configfile> | cn_find -i <input_file> [<input_file> ...] -o <output_file> -c <config_file> |
cn-tidy | cn_find | CN-FIND <infile> | cn_find -i <input_file> [<input_file> ...] -o <output_file> -c <config_file> --tidy |
del-fld | keep_fld | DEL-FLD <infile> <configfile> | keep_fld -i <input_file> [<input_file> ...] -c <config_file> --delete |
del-fld2 | keep_fld | DEL-FLD2 <infile> <configfile> | keep_fld -i <input_file> [<input_file> ...] -c <config_file> --delete |
fix-fmt | fix_fmt | FIX-FMT <marcfile> | fix_fmt -i <input_file> [<input_file> ...] |
keep-fld | keep_fld | KEEP-FLD <infile> <configfile> | keep_fld -i <input_file> [<input_file> ...] -c <config_file> |
keep-fld2 | keep_fld | KEEP-FLD2 <infile> <configfile> | keep_fld -i <input_file> [<input_file> ...] -c <config_file> |
marc-chk | marc_check | MARC-CHK <infile> | marc_check -i <input_file> [<input_file> ...] |
marccount | marc_count | MARCCOUNT <infile> [<infile>] | marc_count -i <input_file> [<input_file> ...] |
Unless otherwise specified, MARC files are in MARC 21 format, with .lex file extensions. Unless otherwise specified, text files are UTF-8-encoded, with .txt, .csv or .tsv file extensions. Config files are also text files, but may have the file extension .cfg for convenience.
For any script, use the option --help, or run the script without arguments/options, to display help text.
Logs will be written to catbridge.log within the working directory. This is a UTF-8 encoded text field and can be read in any text editor. The default logging level is INFO; if option --debug is set, the logging level is changed to DEBUG. See https://docs.python.org/3/library/logging.html#levels for information about logging levels.
Command line arguments may be provided in any order.
For the purposes of these scripts, a field tag is interpreted as a control field tag if and only if it (a) takes a numerical value starting with two zeros, or (b) is either of the Aleph control fields "DB " or "SYS".
- Missing indicators are recorded as blank spaces (data fields only)
- Extra indicators are ignored (data fields only)
cn_find is a utility which extracts control numbers from specified fields and subfields within a file of MARC records.
The fields and subfields to be extracted are specified in a config file.
Usage: cn_find -i <input_file> [<input_file> ...] -o <output_file> -c <config_file> [options]
Options:
--conv Convert 10-digit ISBNs to 13-digit form where possible
--rid Include record ID as the first column of the output file
--tidy Sort and de-duplicate list
--debug Debug mode
--help Show help message and exit
<input_file> is the name of the input file, which must be a file of MARC 21 records.
Multiple input files may be listed. E.g.
cn_find -i file1.lex file2.lex file3.lex -o output.txt -c config.cfg
Wildcard characters may be used. E.g.
cn_find -i file*.lex -o output.txt -c config.cfg
<output_file> is the name of the file to which the control numbers will be written. This should be a text file.
<config_file> is the name of the file containing the configuration directives.
The format of the configuration file is as follows, with one entry per line
=CONTROL_FIELD_TAG [double space] control_number_specification
or
=DATA_FIELD_TAG [double space] indicator_values $subfield_code control_number_specification
Each line must match one of the two regular expressions below. The first is for control fields, the second is for data fields.
^=(00[0-9]|[A-Z]{3})\s*\t(.*?)\s*$
^=[0-9A-Z]{3} [0-9*#][0-9*#]\$[a-z0-9*]\t(.*?)\s*$
Note that for data fields there are two spaces between the field tag and the indicators.
The field tag is specified using three digits or UPPERCASE letters.
For data fields, indicators are each specified using a single digit, or the character # to specify a blank indicator. The wildcard character * may be used to mean any indicator value.
For data fields, the subfield code is specified using a single digit or lowercase letter. Alternatively, $* may be used to mean all subfield codes.
The control number specification tells the script what kind of control number to search for within the subfield. This can either take a value from a pre-defined list, or a regular expression can be used to search for control numbers with any other structure. The dollar sign ($) may thus function as either a regular expression metacharacter or a subfield delimiter, depending on context. Regular expressions are case-sensitive.
Control number specification | Description | Regular expression |
---|---|---|
ISBN | Any structurally plausible ISBN* | \b(?=(?:[0-9]+[- ]?){10})[0-9]{9}[0-9Xx]\b|\b(?=(?:[0-9]+[- ]?){13})[0-9]{1,5}[- ][0-9]+[- ][0-9]+[- ][0-9Xx]\b|\b97[89][0-9]{10}\b|\b(?=(?:[0-9]+[- ]){4})97[89][- 0-9]{13}[0-9]\b |
ISBN10 | Any structurally plausible 10-digit ISBN* | \b(?=(?:[0-9]+[- ]?){10})[0-9]{9}[0-9Xx]\b|\b(?=(?:[0-9]+[- ]?){13})[0-9]{1,5}[- ][0-9]+[- ][0-9]+[- ][0-9Xx]\b |
ISBN13 | Any structurally plausible 13-digit ISBN* | \b97[89][0-9]{10}\b|\b(?=(?:[0-9]+[- ]){4})97[89][- 0-9]{13}[0-9]\b |
ISSN | 8 digits with a hyphen in the middle, where the last digit may be an X | \b[0-9]{4}[ -]?[0-9]{3}[0-9Xx]\b |
BL001 | 9 digits | \b[0-9]{9}\b |
BNB | See https://www.bl.uk/collection-metadata/metadata-services/structure-of-the-bnb-number | \bGB([0-9]{7}|[A-Z][0-9][A-Z0-9][0-9]{4})\b |
LCCN | See https://www.loc.gov/marc/bibliographic/bd010.html | \b[a-z][a-z ][a-z ]?[0-9]{2}[0-9]{6} ?\b |
OCLC | "(OCoLC)" followed by digits | (OCoLC)[0-9]+\b |
ISNI | 16 digits separated into groups of 4 with spaces or hyphens | \b[0]{4}[ -]?[0-9]{4}[ -]?[0-9]{4}[ -]?[0-9]{3}[0-9Xx]\b |
FAST | "fst" followed by digits | \bfst[0-9]{8}\b |
*Note: The ISBN check digit is not validated.
Multiple fields and subfields may be specified. Fields may be repeated with different subfields.
Example:
=001 BL001
=015 **$a BNB
=020 **$a ISBN
=020 ##$z ISBN10
=500 ##$a \b[a-z]{7}\b
=650 *7$0 FAST
In the example above, field 500 subfield $a is being searched for 7-character words.
If option --conv is used, 10-digit ISBNs will be converted to 13-digit form whenever possible (i.e. whenever they are valid ISBNs).
By default, the output file consists of a single column of strings. If option --rid is used, the output file will consist of two columns: the first column will be the record control number from field 001 and the second column will be as per the default output.
If option --tidy is used, the list of control numbers in the output file will be sorted and de-duplicated. Any duplicate control numbers will be written to an additional output file named with the prefix "dp-".
Note: option --tidy cannot be used at the same time as option --rid
[back to top of section] [back to top]
All records exported from Aleph contain a FMT field. This is exported from Aleph as a control field, with no indicator values or subfield codes. However, to be compliant with ISO 2709, control field tags must commence with 00.
fix_fmt is a utility which makes records exported from Aleph ISO 2709-compliant by turning the FMT control field into a data field with two blank indicators and subfield 'a'. Any FMT data fields present in the input record will be copied to the output file without modification.
Usage: marc_count -i <input_file> [<input_file> ...] [options]
Options:
--debug Debug mode
--help Show help message and exit
<input_file> is the name of the input file, which must be a file of MARC 21 records as exported from Aleph.
Multiple input files may be listed. E.g.
fix_fmt -i file1.lex file2.lex file3.lex
Wildcard characters may be used. E.g.
fix_fmt -i file*.lex
The utility writes output for each input file to a file with the same name, but prefixed with "fix-".
[back to top of section] [back to top]
keep_fld is a utility to keep or delete specified fields within one or more file(s) of MARC records.
Usage: keep_fld -i <input_file> [<input_file> ...] -c <config_file> [options]
Options:
--delete *Delete* fields specified in <config_file>
--debug Debug mode
--help Show help message and exit
<input_file> is the name of the input file, which must be a file of MARC 21 records.
Multiple input files may be listed. E.g.
keep_fld -i file1.lex file2.lex file3.lex -c config.cfg
Wildcard characters may be used. E.g.
keep_fld -i file*.lex -c config.cfg
If option --delete is not used, the utility writes output for each input file to a file with the same name, but prefixed with "k-".
If option --delete is used, the utility writes output for each input file to a file with the same name, but prefixed with "d-".
The format of the configuration file is as follows, with one entry per line. This format is chosen to resemble MARCbreaker (MARC screen dump) format.
=CONTROL_FIELD_TAG [double space] content_specification
or
=DATA_FIELD_TAG [double space] indicator_values $subfield_code content_specification
Each line must match one of the two regular expressions below. The first is for control fields, the second is for data fields.
^=(00[0-9]|[A-Z]{3})( )?(.*?)$
^=[0-9A-Z]{3} [0-9*#][0-9*#]($[a-z0-9*] ?.*?)*$
Note that there are two spaces between the field tag and the indicators/control field value.
The field tag is specified using three digits or UPPERCASE letters.
For data fields, indicators are each specified using a single digit, or the character # to specify a blank indicator. The wildcard character * may be used to mean any indicator value.
For data fields, the subfield code is specified using a single digit or lowercase letter. Alternatively, $* may be used to mean all subfield codes.
For data fields, the section $subfield_code content_specification may be omitted in order to keep/delete an entire field, or repeated multiple times in order to provide different instructions for several subfields.
The content specification tells the script what to search for within the subfield. This will be interpreted by the utility as a case-sensitive regular expression. The dollar sign ($) may thus function as either a regular expression metacharacter or a subfield delimiter, depending on context. As a subfield delimited, the dollar sign will always be followed by a subfield code; as a regular expression metacharacter it will either be followed by another dollar sign (this one being a subfield delimiter) or will appear at the end of the string.
Note: Field 001 will always be included in the output file.
To keep/delete field 041:
=041 **
To keep/delete field 041 subfield $a:
=041 **$a
To keep/delete field 041 whenever the second indicator is blank:
=041 *#
To keep/delete field 041 subfield $a whenever it contains "eng":
=041 **$aeng
To keep/delete field 041 subfield $a whenever it contains a digit (remember that the content specification is interpreted as a regular expression):
=041 **$a.*[0-9].*
To keep/delete field 041 subfields $a and $h whenever they are equal to "fre" (here the dollar sign is being used as a regular expression metacharacter as well as a subfield delimiter):
=041 **$a^fre$$h^fre$
To keep/delete any subfields of field 041 which contain an uppercase letter:
=041 **$*.*[A-Z].*
By default, only the fields or subfields specified in the config file will be retained in the output.
If option --delete is used, the specified fields and subfields will instead be deleted from the input file.
[back to top of section] [back to top]
marc_check is a utility which checks the structural validity of records present within one or more file(s) of MARC records, and isolates those found to be flawed.
Usage: marc_check -i <input_file> [<input_file> ...] [options]
Options:
--debug Debug mode
--help Show help message and exit
The utility performs the following checks on each MARC record:
- The record contains data.
- The record Leader has the correct length, namely 24 ascii characters.
- The record length specified in the Leader,
positions 00-04, matches the observed length of the record.
- This number should consist of five digits, and be equal to the length of the entire record, including itself and the record terminator.
- The number is right justified and unused positions contain zeros.
- The record must end with an end-of-record character (hex 1D).
- The record must not contain an end-of-record character (hex 1D) in any other position.
- The base address specified in the Leader,
positions 12-16, does not exceed the size of the record.
- This number should consist of five digits, and be equal to the first character position of the first variable field in the record.
- The number is right justified and unused positions contain zeros.
- The length of the Directory is a multiple of 12.
- The Directory must be followed by an end-of-field character (hex 1E).
- The end-of-field character is not counted when calculating the length of the directory
- Each entry in the Directory corresponds to a single field.
- This field must end with an end-of-field character (hex 1E).
- This field must not contain an end-of-field character (hex 1E) in any other position.
<input_file> is the name of the input file, which must be a file of MARC 21 records.
Multiple input files may be listed. E.g.
marc_check -i file1.lex file2.lex file3.lex
Wildcard characters may be used. E.g.
marc_check -i file*.lex
This will check all the files with .lex suffix in the current directory.
For each input file, the utility writes flawed records to a file with a name of the <input_file>_f.lex. Note that the structural problems with these records mean that it will not normally be possible to view this file in MarcView, MarcEdit, etc.
All records which are found to be structurally valid are written to a file with a name of the <input_file>_ok.lex.
A summary of errors found will be written to the standard log file.
Test data for marc_check is provided in the folder test_data/marc_check. This folder contains two .lex files:
- test_clean.lex
- test_with_errors.lex
test_clean.lex contains 556 records, none of which have structural flaws. Output when marc_check is run on this file should look as follows:
Checking file test_clean.lex
File test_clean.lex contains 0 flawed records
100% [556 records] processed
test_with_errors.lex contains the same 556 records, but with deliberate structural flaws. Output when marc_check is run on this file should look as follows:
Checking file test_with_errors.lex
Error at record 2: Record length does not match length specified in first 5 bytes of record: specified length 1880; observed 1877
Error at record 3: Record length does not match length specified in first 5 bytes of record: specified length 1740; observed 1744
Error at record 4: Directory is invalid: length 966 is not a multiple of 12
Error at record 9: Base address exceeds size of record: base address 99589; observed record length 2137
Error at record 15: Directory does not end with end-of-field character (b'\x1e')
Error at record 22: Field 47 with tag CAT does not end with end-of-field character (b'\x1e')
Error at record 23: Field 1 with tag 001 does not end with end-of-field character (b'\x1e')
Error at record 29: Field 9 with tag 245 does not end with end-of-field character (b'\x1e')
Error at record 30: Field 16 with tag 504 contains unexpected end-of-field character (b'\x1e')
File test_with_errors.lex contains 9 flawed records
100% [556 records] processed
[back to top of section] [back to top]
marc_count is a utility which counts the number of records present within one or more file(s) of MARC records.
Usage: marc_count -i <input_file> [<input_file> ...] [options]
Options:
--debug Debug mode
--help Show help message and exit
<input_file> is the name of the input file, which must be a file of MARC 21 records.
Multiple input files may be listed. E.g.
marc_count -i file1.lex file2.lex file3.lex
Wildcard characters may be used. E.g.
marc_count -i file*.lex
This will count all the files with .lex suffix in the current directory, and output numbers of records per file as well as a total for all files.