Due: 29 March 2023
This assignment will work a bit differently than assignment #1. To complete this assigment you will need to do the following:
- Fork this repository to your own account.
- Clone your fork to your local machine.
- Complete the assignment according to the instructions below.
- Push your changes to your fork.
- Submit a pull request to the original repository. Opening your pull request will be equivalent to you submitting your assignment. You will only need to open one pull request for this assignment. If you make additional changes to your fork, they will automatically show up in the pull request you already opened. Your pull request should have your name in the title (e.g.
Assignment 02 - Mjumbe Poe
).
Write a query to answer each of the questions below.
- Your queries should produce results in the format specified by each question.
- Write your query in a SQL file corresponding to the question number (e.g. a file named query06.sql for the answer to question #6).
- Each SQL file should contain a single query that retrieves data from the database (i.e. a
SELECT
query). - Any SQL that does things other than retrieve data (e.g. SQL that creates indexes or update columns) should be in the db_structure.sql file.
- Some questions include a request for you to discuss your methods. Update this README file with your answers in the appropriate place.
There are several datasets that are prescribed for you to use in this part. Your datasets tables be named:
septa.bus_stops
(SEPTA GTFS -- Use the file for February 26, 2023)- In the tests, the initial table will have the following structure:
CREATE TABLE septa.bus_stops ( stop_id TEXT, stop_name TEXT, stop_lat DOUBLE PRECISION, stop_lon DOUBLE PRECISION, location_type TEXT, parent_station TEXT, zone_id TEXT, wheelchair_boarding INTEGER );
- In the tests, the initial table will have the following structure:
septa.bus_routes
(SEPTA GTFS)- In the tests, the initial table will have the following structure:
CREATE TABLE septa.bus_routes ( route_id TEXT, route_short_name TEXT, route_long_name TEXT, route_type TEXT, route_color TEXT, route_text_color TEXT, route_url TEXT );
- In the tests, the initial table will have the following structure:
septa.bus_trips
(SEPTA GTFS)- In the tests, the initial table will have the following structure:
CREATE TABLE septa.bus_trips ( route_id TEXT, service_id TEXT, trip_id TEXT, trip_headsign TEXT, block_id TEXT, direction_id TEXT, shape_id TEXT );
- In the tests, the initial table will have the following structure:
septa.bus_shapes
(SEPTA GTFS)- In the tests, the initial table will have the following structure:
CREATE TABLE septa.bus_shapes ( shape_id TEXT, shape_pt_lat DOUBLE PRECISION, shape_pt_lon DOUBLE PRECISION, shape_pt_sequence INTEGER );
- In the tests, the initial table will have the following structure:
septa.rail_stops
(SEPTA GTFS)- In the tests, the initial table will have the following structure:
CREATE TABLE septa.rail_stops ( stop_id TEXT, stop_name TEXT, stop_desc TEXT, stop_lat DOUBLE PRECISION, stop_lon DOUBLE PRECISION, zone_id TEXT, stop_url TEXT );
- In the tests, the initial table will have the following structure:
phl.pwd_parcels
(OpenDataPhilly)- In the tests, this data will be loaded in with a geography column named
geog
, and all field names will be lowercased. If you useogr2ogr
to load the file, I recommend you use the following options:(remember to replace the variables with the appropriate values, and replace the backslashes (ogr2ogr \ -f "PostgreSQL" \ PG:"host=localhost port=$PGPORT dbname=$PGNAME user=$PGUSER password=$PGPASS" \ -nln phl.pwd_parcels \ -nlt MULTIPOLYGON \ -t_srs EPSG:4326 \ -lco GEOMETRY_NAME=geog \ -lco GEOM_TYPE=GEOGRAPHY \ -overwrite \ "${DATA_DIR}/phl_pwd_parcels/PWD_PARCELS.shp"
\
) with backticks (`
) if you're using PowerShell)
- In the tests, this data will be loaded in with a geography column named
azavea.neighborhoods
(Azavea's GitHub)- In the tests, this data will be loaded in with a geography column named
geog
, and all field names will be lowercased. If you useogr2ogr
to load the file, I recommend you use the following options:(remember to replace the variables with the appropriate values, and replace the backslashes (ogr2ogr \ -f "PostgreSQL" \ PG:"host=localhost port=$PGPORT dbname=$PGNAME user=$PGUSER password=$PGPASS" \ -nln azavea.neighborhoods \ -nlt MULTIPOLYGON \ -lco GEOMETRY_NAME=geog \ -lco GEOM_TYPE=GEOGRAPHY \ -overwrite \ "${DATA_DIR}/Neighborhoods_Philadelphia.geojson"
\
) with backticks (`
) if you're using PowerShell)
- In the tests, this data will be loaded in with a geography column named
census.blockgroups_2020
(Census TIGER FTP -- Each state has it's own file; Use file number42
for PA)- In the tests, this data will be loaded in with a geography column named
geog
, and all field names will be lowercased. If you useogr2ogr
to load the file, I recommend you use the following options:(remember to replace the variables with the appropriate values, and replace the backslashes (ogr2ogr \ -f "PostgreSQL" \ PG:"host=localhost port=$PGPORT dbname=$PGNAME user=$PGUSER password=$PGPASS" \ -nln census.blockgroups_2020 \ -nlt MULTIPOLYGON \ -lco GEOMETRY_NAME=geog \ -lco GEOM_TYPE=GEOGRAPHY \ -overwrite \ "$DATADIR/census_blockgroups_2020/tl_2020_42_bg.shp"
\
) with backticks (`
) if you're using PowerShell)
- In the tests, this data will be loaded in with a geography column named
census.population_2020
(Census Explorer)- In the tests, the initial table will have the following structure:
CREATE TABLE census.population_2020 ( geoid TEXT, geoname TEXT, total INTEGER );
- Note that the file from the Census Explorer will have more fields than those three. You may have to do some data preprocessing to get the data into the correct format.
- In the tests, the initial table will have the following structure:
Note, when tests aren't passing, I do take logic for solving problems into account for partial credit when grading. When in doubt, write your thinking for solving the problem even if you aren't able to get a full response.
-
Which eight bus stop have the largest population within 800 meters? As a rough estimation, consider any block group that intersects the buffer as being part of the 800 meter buffer.
-
Which eight bus stops have the smallest population above 500 people inside of Philadelphia within 800 meters of the stop (Philadelphia county block groups have a geoid prefix of
42101
-- that's42
for the state of PA, and101
for Philadelphia county)?The queries to #1 & #2 should generate results with a single row, with the following structure:
( stop_name text, -- The name of the station estimated_pop_800m integer, -- The population within 800 meters geog geography -- The geography of the bus stop )
-
Using the Philadelphia Water Department Stormwater Billing Parcels dataset, pair each parcel with its closest bus stop. The final result should give the parcel address, bus stop name, and distance apart in meters. Order by distance (largest on top).
Your query should run in under two minutes.
_HINT: This is a nearest neighbor problem.
Structure:
( parcel_address text, -- The address of the parcel stop_name text, -- The name of the bus stop distance double precision -- The distance apart in meters )
-
Using the
bus_shapes
,bus_routes
, andbus_trips
tables from GTFS bus feed, find the two routes with the longest trips.Your query should run in under two minutes.
HINT: The
ST_MakeLine
function is useful here. You can see an example of how you could use it at this MobilityData walkthrough on using GTFS data. If you find other good examples, please share them in Slack.HINT: Use the query planner (
EXPLAIN
) to see if there might be opportunities to speed up your query with indexes. For reference, I got this query to run in about 15 seconds.HINT: The
row_number
window function could also be useful here. You can read more about window functions in the PostgreSQL documentation. That documentation page uses therank
function, which is very similar torow_number
. For more info about window functions you can check out:- 📑 An Easy Guide to Advanced SQL Window Functions in Towards Data Science, by Julia Kho
- 🎥 SQL Window Functions for Data Scientists (and a follow up with examples) on YouTube, by Emma Ding
Structure:
( route_short_name text, -- The short name of the route trip_headsign text, -- Headsign of the trip shape_geog geography, -- The shape of the trip shape_length double precision -- Length of the trip in meters )
-
Rate neighborhoods by their bus stop accessibility for wheelchairs. Use Azavea's neighborhood dataset from OpenDataPhilly along with an appropriate dataset from the Septa GTFS bus feed. Use the GTFS documentation for help. Use some creativity in the metric you devise in rating neighborhoods.
NOTE: There is no automated test for this question, as there's no one right answer. With urban data analysis, this is frequently the case.
Discuss your accessibility metric and how you arrived at it below:
Description:
-
What are the top five neighborhoods according to your accessibility metric?
-
What are the bottom five neighborhoods according to your accessibility metric?
Both #6 and #7 should have the structure:
( neighborhood_name text, -- The name of the neighborhood accessibility_metric ..., -- Your accessibility metric value num_bus_stops_accessible integer, num_bus_stops_inaccessible integer )
-
With a query, find out how many census block groups Penn's main campus fully contains. Discuss which dataset you chose for defining Penn's campus.
Structure (should be a single value):
( count_block_groups integer )
Discussion:
-
With a query involving PWD parcels and census block groups, find the
geo_id
of the block group that contains Meyerson Hall.ST_MakePoint()
and functions like that are not allowed.Structure (should be a single value):
( geo_id text )
-
You're tasked with giving more contextual information to rail stops to fill the
stop_desc
field in a GTFS feed. Using any of the data sets above, PostGIS functions (e.g.,ST_Distance
,ST_Azimuth
, etc.), and PostgreSQL string functions, build a description (alias asstop_desc
) for each stop. Feel free to supplement with other datasets (must provide link to data used so it's reproducible), and other methods of describing the relationships. SQL'sCASE
statements may be helpful for some operations.Structure:
( stop_id integer, stop_name text, stop_desc text, stop_lon double precision, stop_lat double precision )
As an example, your stop_desc
for a station stop may be something like "37 meters NE of 1234 Market St" (that's only an example, feel free to be creative, silly, descriptive, etc.)
Tip when experimenting: Use subqueries to limit your query to just a few rows to keep query times faster. Once your query is giving you answers you want, scale it up. E.g., instead of
FROM tablename
, useFROM (SELECT * FROM tablename limit 10) as t
.