A comprehensive collection of scripts and tools for digital archiving, web crawling, and file management tasks.
Contains scripts for managing and organizing digital files:
empty_folder_finder.py
: Recursively scans directories to identify completely empty folders, with support for exclusion patterns and detailed loggingdelete_files_and_empty_folders_from_list.py
: Utility for removing files and empty folders based on a provided listchecksum_duplicate_finder.py
: Identifies duplicate files using checksum comparison, supporting multiple algorithms (SHA1, MD5, SHA256) and parallel processing
Contains tools for web archiving and crawling:
web_archive_validator.py
: Validates web archive files and checks their integrityextract_qa.py
: Extracts and processes quality assurance data from web archiveswget_log_reader.py
: Analyzes and processes wget download logscrt-scraper.py
: Web scraping utility for certificate transparency logsbrowsertrix-crawler files and scripts/
: Configuration and scripts for browsertrix-crawler
Contains utilities for working with sitemaps:
sitemap_monitor.py
: Monitors sitemap changes and updatespython_emailer.py
: Email notification system for sitemap changessitemap_xml_to_txt_or_html.py
: Converts XML sitemaps to plain text or HTML format
Contains scripts for interacting with Preservica's API:
a_get_metadata.py
: Retrieves and processes metadata from Preservica assets (including fixity values)b_delete_metadata.py
: Removes metadata from Preservica assetsc_add_metadata_from_csv.py
: Bulk metadata addition from CSV filesd_update_xip_from_csv.py
: Updates XIP metadata from CSV datadownload_preservica_assets.py
: Downloads assets from Preservica
Contains tools for handling video platform exports and processing
Contains scripts for downloading and processing content from the Internet Archive
Contains older versions and deprecated scripts