COVID-19 Scraper

The scraper is created for the project a goal of which is to compare the times that the info on the COVID-19 cases and deaths is being updated on official websites vs official social media of the governments of countries around the world.

Data loading categories

This scraper does not support social media (esp. Facebook which is difficult to scrape). As for the websites, there are 4 main categories:

Websites with little to no JS
- Loaded using wget
Websites (esp. dashboards) with heavy JS
- Loaded using selenium (Firefox driver that loads the page as if it were being opened in a browser)
Websites with the info updated in a feed (which is usually a mixed feed, which means not only the COVID-19 stats get published there)
- The feed is loaded using wget, then the contents of the relevant posts are loaded with selenium
CSV tables
- Loaded with wget

Pipeline

The general pipeline looks like this.

(Its aim is to identify timestamps at which websites are updated with the new COVID-19 related info.)

First, the source is loaded following one of the above methods. Second, the contents are processed to identify the part relevant to reporting COVID-19 cases (deaths) in the HTML/CSV. Third, the contents of the newly downloaded website are compared to a previous download (both the outer relevant tag and the numbers, where applicable). Finally, if the contents match, the new download is deleted, and if not, preserved, with the timestamp and numbers written to data/logs/numbers.json.

For processing HTML, I use Beautiful Soup, and for the CSVs, pandas.

Structure

The structure of the scripts is as follows.

constants.py: constants needed around the other scripts
check_logs.py: checks logs for periods of inactivity of the scraper
generate_numbers.py: generates data/logs/numbers_generated.json using the collected HTML files in data
helpers.py: helper functions
log_watcher.py: monitors the logs and sends emails in case of errors (uses AWS SES)
move_files.py: helps moving files to separate folders by country when they pile up (for storing)
scraper_fix.py: helps manually fixing processing exceptions
- E.g. when a website changes whe way they present the data, the parser fails, and a processing exception occurs that needs to be manually fixed.
Helper classes
- soup_wgets.py: SoupWgets class with static methods that process HTML for each country
- soup_selenium.py: SoupSelenium class with static methods that process HTML for each country
- soup_posts.py: SoupPosts class with class methods that process HTML for each country, click the relevant post, and process the post to get the info
- pandas_csvs.py: Csv class with static methods that process CSV for each country
Scraper scripts: scraper_wgets.py, scraper_selenium.py, scraper_posts.py, scraper_csvs.py
data: saved sources
logs: logs for each scraper
- numbers_generated.json: the latest output of generate_numbers.py
  - It's more reliable than numbers.json bc this file is generated after manually dealing with all processing exceptions.
- numbers.json: the file datetime info is automatically written to in the process of scraping (lacks processing exceptions)

Installation

git clone https://github.com/digitalepidemiologylab/covid-scraper.git
cd covid-scraper
pip install -r requirements.txt

Launching

I use a separate tmux screen for each scraper, and another one for the log watcher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID-19 Scraper

Data loading categories

Pipeline

Structure

Installation

Launching

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
logs		logs
.gitignore		.gitignore
README.md		README.md
check_logs.py		check_logs.py
constants.py		constants.py
generate_numbers.py		generate_numbers.py
helpers.py		helpers.py
log_watcher.py		log_watcher.py
move_files.py		move_files.py
pandas_csvs.py		pandas_csvs.py
requirements.txt		requirements.txt
scraper_csvs.py		scraper_csvs.py
scraper_csvs_afro_who.py		scraper_csvs_afro_who.py
scraper_fix.py		scraper_fix.py
scraper_posts.py		scraper_posts.py
scraper_selenium.py		scraper_selenium.py
scraper_selenium_2.py		scraper_selenium_2.py
scraper_selenium_afro.py		scraper_selenium_afro.py
scraper_wgets.py		scraper_wgets.py
scraper_wgets_afro.py		scraper_wgets_afro.py
soup_posts.py		soup_posts.py
soup_selenium.py		soup_selenium.py
soup_wgets.py		soup_wgets.py

digitalepidemiologylab/covid-scraper

Folders and files

Latest commit

History

Repository files navigation

COVID-19 Scraper

Data loading categories

Pipeline

Structure

Installation

Launching

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages