These code snippets are the core of a scraping workshop for the Hacks/Hackers Buenos Aires Media Party. It'd addressed at people who have already done some Python coding but want to explore scraping in more depth.
To recreate examples from the workshop, create a Python virtual environment like this:
# Create the virtualenv:
virtualenv scraping-env
# Activate it:
source scraping-env/bin/activate
# Finally, install the dependencies for this workshop:
pip install -r requirements.txt
- Getting started with Scraping in Python using requests
- Exploring HTML documents and extracting the data, with lxml
- Saving scraped data to a database with dataset
- Thinking about ETL (Extract, Transform, Load)
- Keep your source data around.
- Dealing with sessions (e.g. logins), forms and searches.
- Running multiple requests in parallel to scrape faster
- Performing sanity checks on your data
- Sunlight's validictory
- Colander
- Example: UK Spend Reporting Tool and here
- Understanding HTTP cache controls to check if new content is available.
- Hiding the fact that you're scraping a site
- Building your own ScraperWiki with Jenkins CI
There are plenty of existing resources on scraping. A few links:
- Paul Bradshaw's Scraping for Journalists, excellent for non-coders.
- School of Data Handbook Recipes
- ScraperWiki (Classic) Docs, moving to GitHub