Key: Green means the municipality's spider is working, yellow means it is incomplete, red means its output is different from what was previously scraped.
This is a project to understand and keep track of the governments of the 130 municipalities in Allegheny County, Pennsylvania. Because there is no (known) centralized database nor predictable structure to how those governments are organized or who runs them, we have to do things the most efficient way I can figure: webscraping each municipality's website.
alleco
uses the open-source scraping framework Scrapy and is written in Python 3. Currently I do not believe there are any other dependencies but it is likely we will use other things to do further NLP on this data in the future.
There are a couple of ways you can use this.
- You can use the data in
results
to see what offices we have scraped and from where we scraped them - You can compare it to the data from the Allegheny County elections site.
- You can write new spiders! There are a lot of these municipalities to scrape.
This repository is structured as such:
alleco
: This is the bulk of the project- There are many Scrapy-related files in here but notably
spiders
holds all of the spiders in the project, one for every municipal body. The naming convention for spiders ismunicipality_name_*
where*
isb
if it is a borough,t
if it is a township, andc
if it is a city.
- There are many Scrapy-related files in here but notably
results
: This folder holds the scraped data, in CSV format, with each file named after the municipality spider.Allegheny_Municipalities.xlsx
: An Excel file listing the municipalities and showcasing some of the manual work I did prior to deciding to webscrape the data.getER.py
: This searches thecontest name
column of the municipal election records from 2017 and 2019 (found inalleco/supp_data
) for the arguments of the script. It is used to find what offices exist for a given municipality. It is possible that some extant offices are not listed (2015 didn't come in CSV format and it's possible a 6-year term position would be missed) or extinct offices are listed (if the position was abolished after 2017/2019). Generally it is extremely helpful though.- Usage:
python getER.py bethel park
for Bethel Park. Note: this will not account for abbreviations (e.g. Mt. Oliver vs. Mount Oliver), but it works using regex so that can be used as well
- Usage:
run.py
: A program that runs all of the spiders and compares them with the previously scraped data to see if there are any of the changes/see if the websites have broken.- Usage:
python run.py
- Usage: