This repo (and the dependecy pywb) has been forked to make changes to allow uploads of archives and indexes to S3. You will need to set up aws credentials locally. Note the AWS_PROFILE and AWS_S3_BUCKET environment variables are set.
docker run \
--volume $(pwd)/crawls:/crawls \
--volume '/path/to/.aws:/root/.aws:ro' \
--env AWS_PROFILE=myprofile \
--env AWS_S3_BUCKET=mybucket \
--cap-add=SYS_ADMIN \
--cap-add=NET_ADMIN \
--shm-size=1g \
--rm \
--interactive \
--tty \
browsertrix-crawler \
crawl \
--url https://webrecorder.net/ \
--generateCDX \
--collection $(uuid -v4) \
--limit 1
Browsertrix Crawler is a simplified browser-based high-fidelity crawling system, designed to run a single crawl in a single Docker container. It is designed as part of a more streamlined replacement of the original Browsertrix.
The original Browsertrix may be too complex for situations where a single crawl is needed, and requires managing multiple containers.
This is an attempt to refactor Browsertrix into a core crawling system, driven by puppeteer-cluster and puppeteer
Thus far, Browsertrix Crawler supports:
- Single-container, browser based crawling with multiple headless/headful browsers
- Support for some behaviors: autoplay to capture video/audio, scrolling
- Support for direct capture for non-HTML resources
- Extensible driver script for customizing behavior per crawl or page via Puppeteer
The Docker container provided here packages up several components used in Browsertrix.
The system uses:
oldwebtoday/chrome
- to install a recent version of Chrome (currently chrome:84)puppeteer-cluster
- for running Chrome browsers in parallelpywb
- in recording mode for capturing the content
The crawl produces a single pywb collection, at /crawls/collections/<collection name>
in the Docker container.
To access the contents of the crawl, the /crawls
directory in the container should be mounted to a volume (default in the Docker Compose setup).
The image currently accepts the following parameters:
browsertrix-crawler [options]
Options:
--help Show help [boolean]
--version Show version number [boolean]
-u, --url The URL to start crawling from [string] [required]
-w, --workers The number of workers to run in parallel
[number] [default: 1]
--newContext The context for each new capture, can be a new: page,
session or browser. [string] [default: "page"]
--waitUntil Puppeteer page.goto() condition to wait for before
continuing [default: "load"]
--limit Limit crawl to this number of pages [number] [default: 0]
--timeout Timeout for each page to load (in seconds)
[number] [default: 90]
--scope Regex of page URLs that should be included in the crawl
(defaults to the immediate directory of URL)
--exclude Regex of page URLs that should be excluded from the crawl.
--scroll If set, will autoscroll to bottom of the page
[boolean] [default: false]
-c, --collection Collection name to crawl to (replay will be accessible
under this name in pywb preview)
[string] [default: "capture"]
--headless Run in headless mode, otherwise start xvfb
[boolean] [default: false]
--driver JS driver for the crawler
[string] [default: "/app/defaultDriver.js"]
--generateCDX If set, generate index (CDXJ) for use with pywb after crawl
is done [boolean] [default: false]
--cwd Crawl working directory for captures (pywb root). If not
set, defaults to process.cwd [string] [default: "/crawls"]
For the --waitUntil
flag, see page.goto waitUntil options.
The default is load
, but for static sites, --wait-until domcontentloaded
may be used to speed up the crawl (to avoid waiting for ads to load for example),
while --waitUntil networkidle0
may make sense for dynamic sites.
The Docker Compose file can simplify building and running a crawl, and includes some required settings for docker run
, including mounting a volume.
For example, the following commands demonstrate building the image, running a simple crawl with 2 workers:
docker-compose build
docker-compose run crawler crawl --url https://webrecorder.net/ --generateCDX --collection wr-net --workers 2
In this example, the crawl data is written to ./crawls/collections/wr-net
by default.
While the crawl is running, the status of the crawl (provide by puppeteer-cluster monitoring) prints the progress to the Docker log.
When done, you can even use the browsertrix-crawler image to also start a local pywb instance to preview the crawl:
docker run -it -v $(pwd)/crawls:/crawls -p 8080:8080 webrecorder/browsertrix-crawler pywb
Then, loading the http://localhost:8080/wr-net/https://webrecorder.net/
should load a recent crawl of the https://webrecorder.net/
site.
Browsertrix Crawler can of course all be run directly with Docker run, but requires a few more options.
In particular, the --cap-add
and --shm-size
flags are needed to run Chrome in Docker.
docker run -v $PWD/crawls:/crawls --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1g -it webrecorder/browsertrix-crawler --url https://webrecorder.net/ --workers 2
Initial support for development of Browsertrix Crawler, was provided by Kiwix
Initial functionality for Browsertrix Crawler was developed to support the zimit project in a collaboration between Webrecorder and Kiwix, and this project has been split off from Zimit into a core component of Webrecorder.