Skip to content

Latest commit

 

History

History
79 lines (67 loc) · 5.46 KB

CHANGES.md

File metadata and controls

79 lines (67 loc) · 5.46 KB

CHANGES

v0.5.0

  • Scope: support for scopeType: domain to include all subdomains and ignoring 'www.' if specified in the seed.
  • Profiles: support loading remote profile from URL as well as local file
  • Non-HTML Pages: Load non-200 responses in browser, even if non-html, fix waiting issues with non-HTML pages (eg. PDFs)
  • Config options: Fix setting user-agent
  • Page behavior: latest browsertrix-behaviors, also add experimental Cloudflare interstitial wait.
  • Error handling: better error handling for redis errors
  • State: Support loading of crawl state from config.yaml
  • State: Support serialization of crawl state to crawls subdirectory, both while running (keeping last N states) and on exit.
  • State: Graceful saving of crawl state on ctrl+c interrupt
  • State: Memory or Redis based crawl state
  • Config: Support additional options via CRAWL_ARGS environment variable
  • WACZ Upload: Support for S3 upload of WACZ upon crawl completion
  • WACZ Upload: HTTP/Redis webhook to notify of upload completion
  • Crawl Scope: Support for extraHops to optionally crawl an extra hop beyond scope
  • Signing: Support for optional signing of WACZ
  • Dependencies: update to latest pywb, wacz and browsertrix-behaviors packages

v0.4.4

  • Page Block Rules Fix: 'request already handled' errors by avoiding adding duplicate handlers to same page.
  • Page Block Rules Fix: await all continue/abort() calls and catch errors.
  • Page Block Rules: Don't apply to top-level page, print warning and recommend scope rules instead.
  • Setup: Attempt to create the crawl working directory (cwd) specified via --cwd if it doesn't exist.
  • Scope Types: Rename 'none' -> 'page' (single page only) and 'page' -> 'page-spa' (page with hashtags).
  • README: Add more scope rule examples, clarify distinction between scope rules and block rules.
  • README: Update old type -> scopeType, list new scope types.

v0.4.3

  • BlockRules Fixes: When considering the 'inFrameUrl' for a navigation request for an iframe, use URL of parent frame.
  • BlockRules Fixes: Always allow pywb proxy scripts.
  • Logging: Improved debug logging for block rules (log blocked requests and conditional iframe requests) when 'debug' set in 'logging'

v0.4.2

  • Compose/docs: Build latest image by default, update README to refer to latest image
  • Fix typo in crawler.capturePrefix that resulted in directFetchCapture() always failing
  • Tests: Update all tests to use test-crawls directory
  • extractLinks() just extracts links from default selectors, allows custom driver to filter results
  • loadPage() accepts a list of selector options with selector, extract, and isAttribute settings for further customization of link extraction

v0.4.1

  • BlockRules Optimizations: don't intercept requests if no blockRules
  • Profile Creation: Support extending existing profile by passing a --profile param to load on startup
  • Profile Creation: Set default window size to 1600x900, add --windowSize param for setting custom size
  • Behavior Timeouts: Add --behaviorTimeout to specify custom timeout for behaviors, in seconds (defaulting to 90 seconds)
  • Load Wait Default: Switch to 'load,networkidle2' to speed-up waiting for initial load
  • Multi-platform build: Support building for amd64 and Arm using oldwebtoday/chrome:91 images (check for google-chrome and chromium-browser automatically)
  • CI: Build a multi-platform (amd64 and arm64) image on each release

v0.4.0

  • YAML based config, specifyable via --config property or via stdin (with '--config stdin')
  • Support for different scope types ('page', 'prefix', 'host', 'any', 'none') + crawl depth at crawl level
  • Per-Seed scoping, including different scope types, or depth and include/exclude rules configurable per seed in 'seeds' list via YAML config
  • Support for 'blockRules' for blocking certain URLs from being stored in WARCs, conditional blocking for iframe based on contents, and iframe URLs (see README for more details)
  • Interactive profile creation: creating profiles by interacting with embedded browser loaded in the browser (see README for more details).
  • Screencasting: streaming the output of each window via websocket-based streaming, configurable with --screencastPort option
  • New 'window' based parallelization: Open each worker in new window in same session
  • Simplified custom driver config, default calls 'loadPage'
  • Refactor arg parsing, other auxiliary functions into separate utils files
  • Image customization: support for customizing browser image, eg. building with Chromium instead of Chrome, support for ARM architecture builds (see README for more details).
  • Update to latest pywb (2.5.0b4), browsertrix-behaviors (0.2.3), py-wacz (0.3.1)

v0.3.2

  • Added a --urlFile option: Allows users to specify a .txt file list of exact URLs to crawl (one URL per line).

v0.3.1

  • Improved shutdown wait: Instead of waiting for 5 secs, wait until all pending requests are written to WARCs
  • Bug fix: Use async APIs for combine WARC to avoid spurrious issues with multiple crawls
  • Behaviors Update to Behaviors to 0.2.1, with support for facebook pages

v0.3.0

  • WARC Combining: --combineWARC and --rolloverSize flags for generating combined WARC at end of crawl, each WARC upto specified rolloverSize
  • Profiles: Support for creating reusable browser profiles, stored as tarballs, and running crawl with a login profile (see README for more info)
  • Behaviors: Switch to Browsertrix Behaviors v0.1.1 for in-page behaviors
  • Logging: Customizable logging options via --logging, including behavior log, behavior debug log, pywb log and crawl stats (default)