Documentation and scripts to launch an OpenWPM crawl on a Kubernetes cluster on GCP GKE.
- Access to GCP and the ability to provision resources in a GCP project
- Google SDK installed locally
- This will allow us to provision resources from CLI
- Docker
- We will use this to build the OpenWPM docker container
- A GCP Project setup, referred to below as
$PROJECT
- Visit GCP Kubernetes Engine API to enable the API.
- You may need to set the Billing account.
For the remainder of these instructions, you are assumed to be in the deployment/gcp/
folder, and you should have the following env vars set to the GCP project you're using as well as a prefix to identify your resources within that project (e.g., your username):
export PROJECT="foo-sandbox"
export CRAWL_PREFIX="foo"
gcloud auth login
to authenticate with GCP.gcloud config set project $PROJECT
to the project that was created.gcloud config set compute/zone us-central1-f
to the default region you want resources to be provisioned.- GCP Regions for current list of regions.
gcloud components install kubectl
The following command will create a zonal GKE cluster with n1-highcpu-16 nodes ($0.5672/node/h) with IP-Alias enabled (makes it a bit easier to connect to managed Redis instances from the cluster).
You may want to adjust fields within ./start_gke_cluster.sh
where appropriate such as:
- num-nodes, min-nodes, max-nodes (for a large crawl you may want up to 15 nodes, this is different than the number of pods which is specificed by the parallelism field in the crawl.yaml - one node can host multiple pods)
- machine-type
- See the GKE Quickstart guide and cluster create documentation.
./start_gke_cluster.sh $CRAWL_PREFIX-cluster
Note: For testing, you can use preemptible nodes ($0.1200/node/h) instead:
./start_gke_cluster.sh $CRAWL_PREFIX-cluster --preemptible
gcloud container clusters get-credentials $CRAWL_PREFIX-cluster
This allows subsequent kubectl
commands to interact with our cluster (using the context gke_{PROJECT}_{ZONE}_{CLUSTER_NAME}
)
Set the Sentry DSN as a kubectl secret (change foo
below):
kubectl create secret generic sentry-config \
--from-literal=sentry_dsn=foo
To run crawls without Sentry, remove the following from the crawl config after it has been generated below:
- name: SENTRY_DSN
valueFrom:
secretKeyRef:
name: sentry-config
key: sentry_dsn
If one of the pre-built OpenWPM Docker images are not sufficient:
cd path/to/OpenWPM
docker build -t gcr.io/$PROJECT/$CRAWL_PREFIX-openwpm .
cd -
gcloud auth configure-docker
docker push gcr.io/$PROJECT/$CRAWL_PREFIX-openwpm
Remember to change the crawl.yaml
to point to image: gcr.io/$PROJECT/$CRAWL_PREFIX-openwpm
.
Launch a 1GB Basic tier Google Cloud Memorystore for Redis instance ($0.049/GB/hour):
gcloud redis instances create $CRAWL_PREFIX-redis --size=1 --region=us-central1 --redis-version=redis_4_0
Launch a temporary redis-box pod deployed to the cluster which we use to interact with the above Redis instance:
kubectl apply -f redis-box.yaml
Use the following output:
gcloud redis instances describe $CRAWL_PREFIX-redis --region=us-central1
... to set the corresponding env var:
export REDIS_HOST=10.0.0.3
(See https://cloud.google.com/memorystore/docs/redis/connecting-redis-instance for more information.)
Create a comma-separated site list as per:
echo "1,http://www.example.com
2,http://www.example.org
3,http://www.princeton.edu
4,http://citp.princeton.edu/?foo='bar" > site_list.csv
../load_site_list_into_redis.sh crawl-queue site_list.csv
(Optional) To load Alexa Top 1M into redis:
cd ..; ./load_alexa_top_1m_site_list_into_redis.sh crawl-queue; cd -
You can also specify a max rank to load into the queue. For example, to add the top 1000 sites from the Alexa Top 1M list:
cd ..; ./load_alexa_top_1m_site_list_into_redis.sh crawl-queue 1000; cd -
(Optional) Use some of the ../../utilities/crawl_utils.py
code. For instance, to fetch and store a sample of Alexa Top 1M to /tmp/sampled_sites.json
:
source ../../venv/bin/activate
cd ../../; python -m utilities.get_sampled_sites; cd -
Since each crawl is unique, you need to configure your crawl.yaml
deployment configuration. We have provided a template to start from:
envsubst < ./crawl.tmpl.yaml > crawl.yaml
Use of envsubst
has already replaced $REDIS_HOST
with the value of the env var set previously, but you may still want to adapt crawl.yaml
:
- spec.parallelism (this is the number of pods that can be used - many pods can fit on one node, kubernetes will manage this based on the resources used by any given set of pods on a node)
- spec.containers.image
- spec.containers.env
- spec.containers.resources
Note: A useful naming convention for CRAWL_DIRECTORY
is YYYY-MM-DD_description_of_the_crawl
.
Some nodes including the master node can become temporarily unavailable during cluster auto-scaling operations. When larger new crawls are started, this can cause disruptions for a couple of minutes after the crawl has started.
To avoid this, set the amount of nodes (to, say, 15) before starting the crawl:
gcloud container clusters resize $CRAWL_PREFIX-cluster --num-nodes=15
When you are ready, deploy the crawl:
kubectl create -f crawl.yaml
Note that for the remainder of these instructions, metadata.name
is assumed to be set to openwpm-crawl
.
Launch redis-cli:
kubectl exec -it redis-box -- sh -c "redis-cli -h $REDIS_HOST"
Current length of the queue:
llen crawl-queue
Amount of queue items marked as processing:
llen crawl-queue:processing
Contents of the queue:
lrange crawl-queue 0 -1
Check out the GCP GKE Console
Also:
watch kubectl top nodes
watch kubectl top pods --selector=job-name=openwpm-crawl
watch kubectl get pods --selector=job-name=openwpm-crawl
(Optional) To see a more detailed summary of the job as it executes or after it has finished:
kubectl describe job openwpm-crawl
- Visit GCP Logging Console
- Select
GKE Container
(Optional) You can also spin up the Kubernetes Dashboard UI as per these instructions which will allow for easy access to status and logs related to running jobs/crawls.
The crawl data will end up in Parquet format in the S3 bucket that you configured.
If you can't remember which $CRAWL_PREFIX
you specified to start the crawl,
you can check the currently running clusters using:
gcloud container clusters list
You can check the currently running redis instances using:
gcloud redis instances list --region=us-central1
Be sure that you don't kill clusters or redis instances used by other users of your GCP project (if any).
kubectl delete -f crawl.yaml
gcloud redis instances delete $CRAWL_PREFIX-redis --region=us-central1
kubectl delete -f redis-box.yaml
While the cluster has auto-scaling activated, and thus should scale down when not in use, it can sometimes be slow to do this or fail to do this adequately. In these instances, it is a good idea to set the number of nodes to 0 or 1 manually:
gcloud container clusters resize $CRAWL_PREFIX-cluster --num-nodes=1
It will still auto-scale up when the next crawl is executed.
If crawls are not to be run and the cluster need not to be accessed within the next hours or days, it is safest to delete the cluster:
gcloud container clusters delete $CRAWL_PREFIX-cluster
In case of any unexpected issues, rinse (clean up) and repeat. If the problems remain, file an issue against https://github.com/openwpm/openwpm-crawler.