Skip to content

Run a high-fidelity browser-based crawler in a single Docker container

License

Notifications You must be signed in to change notification settings

TheDataShed/browsertrix-crawler

 
 

Repository files navigation

The Data Shed usage

This repo (and the dependecy pywb) has been forked to make changes to allow uploads of archives and indexes to S3. You will need to set up aws credentials locally. Note the AWS_PROFILE and AWS_S3_BUCKET environment variables are set.

docker run \
    --volume $(pwd)/crawls:/crawls \
    --volume '/path/to/.aws:/root/.aws:ro' \
    --env AWS_PROFILE=myprofile \
    --env AWS_S3_BUCKET=mybucket \
    --cap-add=SYS_ADMIN \
    --cap-add=NET_ADMIN \
    --shm-size=1g \
    --rm \
    --interactive \
    --tty \
    browsertrix-crawler \
    crawl \
    --url https://webrecorder.net/ \
    --generateCDX \
    --collection $(uuid -v4) \
    --limit 1

Browsertrix Crawler

Browsertrix Crawler is a simplified browser-based high-fidelity crawling system, designed to run a single crawl in a single Docker container. It is designed as part of a more streamlined replacement of the original Browsertrix.

The original Browsertrix may be too complex for situations where a single crawl is needed, and requires managing multiple containers.

This is an attempt to refactor Browsertrix into a core crawling system, driven by puppeteer-cluster and puppeteer

Features

Thus far, Browsertrix Crawler supports:

  • Single-container, browser based crawling with multiple headless/headful browsers
  • Support for some behaviors: autoplay to capture video/audio, scrolling
  • Support for direct capture for non-HTML resources
  • Extensible driver script for customizing behavior per crawl or page via Puppeteer

Architecture

The Docker container provided here packages up several components used in Browsertrix.

The system uses:

  • oldwebtoday/chrome - to install a recent version of Chrome (currently chrome:84)
  • puppeteer-cluster - for running Chrome browsers in parallel
  • pywb - in recording mode for capturing the content

The crawl produces a single pywb collection, at /crawls/collections/<collection name> in the Docker container.

To access the contents of the crawl, the /crawls directory in the container should be mounted to a volume (default in the Docker Compose setup).

Crawling Parameters

The image currently accepts the following parameters:

browsertrix-crawler [options]

Options:
      --help         Show help                                         [boolean]
      --version      Show version number                               [boolean]
  -u, --url          The URL to start crawling from          [string] [required]
  -w, --workers      The number of workers to run in parallel
                                                           [number] [default: 1]
      --newContext   The context for each new capture, can be a new: page,
                     session or browser.              [string] [default: "page"]
      --waitUntil    Puppeteer page.goto() condition to wait for before
                     continuing                                [default: "load"]
      --limit        Limit crawl to this number of pages   [number] [default: 0]
      --timeout      Timeout for each page to load (in seconds)
                                                          [number] [default: 90]
      --scope        Regex of page URLs that should be included in the crawl
                     (defaults to the immediate directory of URL)
      --exclude      Regex of page URLs that should be excluded from the crawl.
      --scroll       If set, will autoscroll to bottom of the page
                                                      [boolean] [default: false]
  -c, --collection   Collection name to crawl to (replay will be accessible
                     under this name in pywb preview)
                                                   [string] [default: "capture"]
      --headless     Run in headless mode, otherwise start xvfb
                                                      [boolean] [default: false]
      --driver       JS driver for the crawler
                                     [string] [default: "/app/defaultDriver.js"]
      --generateCDX  If set, generate index (CDXJ) for use with pywb after crawl
                     is done                          [boolean] [default: false]
      --cwd          Crawl working directory for captures (pywb root). If not
                     set, defaults to process.cwd  [string] [default: "/crawls"]

For the --waitUntil flag, see page.goto waitUntil options.

The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example), while --waitUntil networkidle0 may make sense for dynamic sites.

Example Usage

With Docker-Compose

The Docker Compose file can simplify building and running a crawl, and includes some required settings for docker run, including mounting a volume.

For example, the following commands demonstrate building the image, running a simple crawl with 2 workers:

docker-compose build
docker-compose run crawler crawl --url https://webrecorder.net/ --generateCDX --collection wr-net --workers 2

In this example, the crawl data is written to ./crawls/collections/wr-net by default.

While the crawl is running, the status of the crawl (provide by puppeteer-cluster monitoring) prints the progress to the Docker log.

When done, you can even use the browsertrix-crawler image to also start a local pywb instance to preview the crawl:

docker run -it -v $(pwd)/crawls:/crawls -p 8080:8080 webrecorder/browsertrix-crawler pywb

Then, loading the http://localhost:8080/wr-net/https://webrecorder.net/ should load a recent crawl of the https://webrecorder.net/ site.

With docker run

Browsertrix Crawler can of course all be run directly with Docker run, but requires a few more options.

In particular, the --cap-add and --shm-size flags are needed to run Chrome in Docker.

docker run -v $PWD/crawls:/crawls --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1g -it webrecorder/browsertrix-crawler --url https://webrecorder.net/ --workers 2

Support

Initial support for development of Browsertrix Crawler, was provided by Kiwix

Initial functionality for Browsertrix Crawler was developed to support the zimit project in a collaboration between Webrecorder and Kiwix, and this project has been split off from Zimit into a core component of Webrecorder.

License

AGPLv3 or later, see LICENSE for more details.

About

Run a high-fidelity browser-based crawler in a single Docker container

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 94.7%
  • Dockerfile 5.3%