A collection of scripts to help with various digital archiving tasks.
Contains various scripts for ad-hoc tasks that may or may not be repeated in the future.
Contains scripts relating to browsertrix-crawler
Contains a script to reformat the json response from the Internet Archive's CDX API and provides better duplicate removal. Outputs to a .txt file.
Contains scripts to partially automate the production of OPEX XML files for use with Preservica.
Contains various scripts that utilise Preservica's API using pyPreservica.
Uses Semaphore's CLSClient to auto-classify documents and sorts by topic score.
Contains a script to produce a plain list of URLs from an XML sitemap (outputs to .txt, .html, or terminal).
A script which reads a folder of WARC files and cross-references the content with a list of URLs. It also uses BS4 to search the HTML content for specific HTML elements.