Skip to content

Retrieve Data from a ScrapeBot Database

License

Notifications You must be signed in to change notification settings

MarHai/ScrapeBotR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScrapeBotR

Orchestrate Instances and Retrieve Data from a ScrapeBot Database

ScrapeBotR (with "R") allows to easily retrieve (large amounts of) data from a ScrapeBot installation. The package provides easy-to-use functions to read and export instances, recipes, runs, log information, and data. Thereby, the package plugs neatly into the tidyverse as it makes heavy use of tibbles.

The ScrapeBot (without "R") is a tool for so-called "agent-based testing" to automatically visit, modify, and scrape a defined set of webpages regularly. It was built to automate various web-based tasks and keep track of them in a controllable way for academic research, primarily in the realm of computational social science.

Installation

Install the most recent development version using devtools:

devtools::install_github('MarHai/ScrapeBotR')

Usage

Import the installed version ...

library(ScrapeBotR)

... and start using it by defining your ScrapeBot database. Credentials to access your database need to be stored in an INI file somewhere in your computer's home directory (i.e., under ~/, which usually translates to /home/my_user under *nix or C:\Users\my_user\Documents under Windows). You can either create this file by hand or use the ScrapeBotR's helper function to create it:

write_scrapebot_credentials(
  host = 'path_to_my_database',
  user = 'database_username',
  password = 'database_password'
)

Alternatively, you can create the INI file manually. Ideally, the file is located directly within your home directory and named .scrapebot.ini (where the leading . prevents it from being shown in the file browser most of the time). The INI file is essentially just a raw-text file with a so-called section name and some key-value pairs, each of which cannot contain spaces between a key and its value. Any unnecessary settings can be omitted (e.g., the port number). Here's how the INI file could look like:

[a name for me to remember]
host=localhost
port=3307
user=my_personal_user
password=abcd3.45d:cba!
database=scrapebot

Once you got that out of the way, try connecting to your database, using the section name again (this is because you can have multiple sections referring to multiple ScrapeBot installations):

connection <- connect_scrapebot('a name for me to remember')

If this doesn't yield an error, you are good to go. And you could start, for example, by ...

  • listing the available recipes through get_recipes()
  • listing the available instances through get_instances()
  • get information about specific runs through get_runs()
  • Collect data via get_run_data()
  • bulk-download-and-compress screenshots from S3 via collect_screenshots_from_s3()
  • ...

Since version 0.5.0, you can also orchestrate servers on Amazon Web Services (AWS). Therefore, you first need an AWS account, to which also any raised costs will be charged. Next, generate an IAM user within your AWS account and create an API key. Also, you need an SSH key pair (in PEM format). Afterwards, use the respective R functions parallel to the ScrapeBot database (above) to write your credentials into an INI file and connect to your AWS account:

write_aws_credentials(
  access_key_id = 'aws_access_key', 
  secret_access_key = 'aws_access_secret', 
  ssh_private_pem_file = 'path_to_ssh_private_pem_file', 
  ssh_public_pem_file = 'path_to_ssh_public_pem_file'
)
aws_connection <- connect_aws()

Given that this does not yield an error, you could ...

  • start an AWS RDS instance as ScrapeBot database through aws_launch_database()
  • launch an AWS S3 instance to store screenshots through aws_launch_storage()
  • run an EC2 instance as ScrapeBot instance through aws_launch_instance()
  • store the connection object for later using aws_save_connection()
  • restore (load) the connection object some days/weeks/months/studies later through aws_load_connection()
  • terminate all AWS instances through the respective aws_terminate_* functions
  • ...

Detailed documentation is available for every function inside R.

Citation

Haim, Mario (2021). ScrapeBotR. An R package to orchestrate ScrapeBot for agent-based testing. Available at https://github.com/MarHai/ScrapeBotR/.