- LibreOffice 24.8 is used by default if available to fix false negatives with some MS Office files
- LxmlScraper will now consider XML files with US-ASCII encoding declaration valid if --charset=UTF-8 parameter was used
- Improve scraper scrape-file help text
- Add GHOSTSCRIPT_PATH configuration field to run externally packaged Ghostscript by default
- RPM package now supports Ghostscript 10.03.1, which fixes some PDF files from being erroneously detected as invalid
- Fix verapdf in $PATH always overriding VERAPDF_PATH configuration value
- Filter out XML-incompatible characters from scraper output.
- Check that av streams inside containers are supported with the specific container.
- Improve processing for very large XML files
- Update Epub version support from 3.2 to 3.3
- Identify some CSV files as text/csv instead of application/csv
- Fix crash due to incorrectly detecting a ZIP file and attempting to parse it
- Process certain text based Windows configuration files, identified by Magic, as plain text
- Support newer version of ImageMagick than v6.9.12.88
- Loosen SEG-Y detection requirements when SEG-Y version declaration is missing
- SEG-Y header with left-padded card numbers are allowed
- SEG-Y header with card markers without numbers are allowed
- SEG-Y header with C40 EOF. header EOF is allowed
- Replace VerapdfDetector with ExifToolDetector for detecting PDf/A files.
- Add Ghostscript's stdout to errors for invalid PDF files.
- Remove missing system path warning when importing file-magic/libmagic.
- Detect format version of ODF files correctly
- Loosen SEG-Y detection requirements: empty SEG-Y header is now allowed.
- Installation instructions for AlmaLinux 9 using RPM packages
- Add support for h265 (HEVC) video streams.
- Update the following mimetypes:
- audio/mp4 to audio/aac for AAC streams
- video/mp4 to video/h264 for AVC streams.
- Fix a bug causing PDF files with warnings (but not severe errors) to be detected as not well-formed.
- Add support for JP2 files.
- Support Apple M4A AAC files
- The RPM package conflicts with ffmpeg-free, because the ffmpeg-free package does not have all the codecs file-scraper needs
- Modernised Python source code with pyupgrade, some manual cleanups as well
- Fix a bug related to ffmpeg that caused validation to fail with some video files.
- Remove some Python 2 remainders from the code.
- Make config file for excecutable paths
- Json files are now detected as plain text
- Add note to dummy_scraper.py on formatVersion not being supported with mimetype text/plain
- Change well-formedness results of the following scrapers, because they do not validate:
- ExifTool Scraper
- Magic Scraper
- Textfile Scraper: TextfileScraper and TextEncodingMetaScraper
- Change well-formedness result of Wand Scraper, because it does not validate.
- Increase stack size for Schematron compilation.
- File magic version fix for CentOS7 installation.
- Add RHEL9 compatibility.
- Change well-formedness result of PIL Scraper, because it does not validate.
- Update info message regarding PDF files.
- Fix python2 warc-tools requirement in python3 spec file.
- Add grade for DPX version 1.0.
- Differentiate MPEG-1 PS and MPEG-2 PS containers.
- Add support for multi-frame TIFF/PNG images.
- Add SEG-Y file format detection and grade it as bit-level file format.
- Python 2.7 support officially removed.
- Fix WMA and WMV file date rate detection.
- Changed grading according to version 1.11.0 of DPS File Formats specifications.
- Fix wrong script paths.
- Add missing return code handling to multiple scrapers.
- Fix color detection for specific WMV files.
- Add support for SIARD file format.
- Add support for WMA and WMV file formats.
- Fix issue where FFmpeg was run even though file format well-formed check was skipped.
- Add support for AIFF file format.
- Add support for DNG file format versions 1.1 and 1.2.
- Pin file-magic version 0.4.0 or less since newer version requires a newer libmagic than CentOS 7 ships by default.
- Make scraper functional with veraPDF older than 1.18. In older versions,
.pdf
file extension is required for the PDF files. - Fix veraPDF command similar to JHOVE command.
- Handle possible errors found in file format detection properly.
- Allow wand to deliver EXIF version as ASCII codes or plain text.
- Add test case for file-5.30 recursion bug
- Improve LxmlScraper's error handling.
- Fix scraper not being able to scrape PDF files that do not have
.pdf
file extension. This requires veraPDF 1.18 or newer.
- Update installation guide for Python 3.6 in README.rst.
- Add DNG file format support.
- Fix DV file format detection.
- Update requirements in setup file.
- Add MPEG-4 version 2 (ISO/IEC 14496-14) video container support.
- Add support for JHove 1.24.1.
- Fix bug in quicktime identification.
- Add EPUB support to file scraper.
- Fix bug caused by wand trying to UTF-8 decode latin-1 Exif field values. WandScraper will not try to handle Exif field values that it does not use.
- Changed grading according to version 1.10.0 of DPS File Formats specifications
- Changed the name
ContainerGrader
to a more preciseContainerStreamsGrader
- Addeed quote character support for CSV files.
- Update version number in file_scraper/__init__.py
- Fix bug in detecting missing files when mimetype option was given
- Use LibreOffice 7.2 to scrape MS Office formats. This fixes stuck processes with certain MS Excel files.
- Minor fix in e2e tests.
- Changes in PDF scraping:
- Both JHove and Ghostscript are now run for all PDF files, but the scraping results are ignored if the file is not supported by the tool.
- Added PDF root version reporting to JHove scraper output
- Select Python 2/3 version of dpx-validator depending on the current environment.
- Added grades for files into the scraper output. The grade defines whether a file is recommended or suitable for digital preservation.
- Well-formed result is unknown for non-supported file or stream formats.
- MIME type is (usually) given even if there is no scraper implementation.
- Added ProRes grading as bit-level format with recommended format.
- Added video/avi support.
- Unknown text encodings are processed without failing
- Forbidden characters set is expanded for ISO-8859-15 charsets
- Better handling of local XML schema file paths
- Fix PDF version detection
- Remove ARC file format support
- Update PRONOM codes for file formats
- Handle conflicts between scraper results in a new scraper
- Update MS Office version handling
- Build el7 python3 rpms
- Fix scraper CLI in python3
- Filter out unicode normalization warnings
- Fix illegal control characters being printed in scraper error messages
- Minor fixes related to schema cleanup
- Fix accidental set-type value
- Build el8 rpms
- Fix Fido caching bug
- Support for JPEG/EXIF files with older file magic library, tested with 5.11
- Support validation of XML files with relative path to local schemas
- Increase maximum CSV field size
- Fix colorspace value handling and add support for ICC profile name
- Remove JPEG2000 from AVI and AVC/AAC from MPEG-1/2 PS to meet the current specifications
- Support newer version of veraPDF
- FLAC stream support for Matroska videos added
- MIME type update for LPCM streams
- Wand memory leaking issues fixed
- Filter unnecessary v.Nu warnings related to HTML5 validation
- Distinguish JP2 and JPX files
- Add command-line interface
- Add key to info dict to contain used tools in scraping
- Minor bugfix related to unavailabe file format version
- Raise maximum image size for PIL
- Add support for images with grayscale+alpha channels
- Changed Wand and ImageMagick error messages have been updated to tests.
- Exif version is extracted from JPEG metadata using Python Wand module. JFIF version is extracted with file-scraper's magiclib module. Exif version for a JPEG file consists of four bytes of ASCII values representing eg. '0221' which is interpreted as 2.2.1, conforming to the Finnish national digital preservation service specification for file formats.