Skip to content

Latest commit

 

History

History
253 lines (215 loc) · 13 KB

README.md

File metadata and controls

253 lines (215 loc) · 13 KB

Assignment 02

Due: 29 March 2023

This assignment will work a bit differently than assignment #1. To complete this assigment you will need to do the following:

  1. Fork this repository to your own account.
  2. Clone your fork to your local machine.
  3. Complete the assignment according to the instructions below.
  4. Push your changes to your fork.
  5. Submit a pull request to the original repository. Opening your pull request will be equivalent to you submitting your assignment. You will only need to open one pull request for this assignment. If you make additional changes to your fork, they will automatically show up in the pull request you already opened. Your pull request should have your name in the title (e.g. Assignment 02 - Mjumbe Poe).

Instructions

Write a query to answer each of the questions below.

  • Your queries should produce results in the format specified by each question.
  • Write your query in a SQL file corresponding to the question number (e.g. a file named query06.sql for the answer to question #6).
  • Each SQL file should contain a single query that retrieves data from the database (i.e. a SELECT query).
  • Any SQL that does things other than retrieve data (e.g. SQL that creates indexes or update columns) should be in the db_structure.sql file.
  • Some questions include a request for you to discuss your methods. Update this README file with your answers in the appropriate place.

There are several datasets that are prescribed for you to use in this part. Your datasets tables be named:

  • septa.bus_stops (SEPTA GTFS -- Use the file for February 26, 2023)
    • In the tests, the initial table will have the following structure:
      CREATE TABLE septa.bus_stops (
          stop_id TEXT,
          stop_name TEXT,
          stop_lat DOUBLE PRECISION,
          stop_lon DOUBLE PRECISION,
          location_type TEXT,
          parent_station TEXT,
          zone_id TEXT,
          wheelchair_boarding INTEGER
      );
  • septa.bus_routes (SEPTA GTFS)
    • In the tests, the initial table will have the following structure:
      CREATE TABLE septa.bus_routes (
          route_id TEXT,
          route_short_name TEXT,
          route_long_name TEXT,
          route_type TEXT,
          route_color TEXT,
          route_text_color TEXT,
          route_url TEXT
      );
  • septa.bus_trips (SEPTA GTFS)
    • In the tests, the initial table will have the following structure:
      CREATE TABLE septa.bus_trips (
          route_id TEXT,
          service_id TEXT,
          trip_id TEXT,
          trip_headsign TEXT,
          block_id TEXT,
          direction_id TEXT,
          shape_id TEXT
      );
  • septa.bus_shapes (SEPTA GTFS)
    • In the tests, the initial table will have the following structure:
      CREATE TABLE septa.bus_shapes (
          shape_id TEXT,
          shape_pt_lat DOUBLE PRECISION,
          shape_pt_lon DOUBLE PRECISION,
          shape_pt_sequence INTEGER
      );
  • septa.rail_stops (SEPTA GTFS)
    • In the tests, the initial table will have the following structure:
      CREATE TABLE septa.rail_stops (
          stop_id TEXT,
          stop_name TEXT,
          stop_desc TEXT,
          stop_lat DOUBLE PRECISION,
          stop_lon DOUBLE PRECISION,
          zone_id TEXT,
          stop_url TEXT
      );
  • phl.pwd_parcels (OpenDataPhilly)
    • In the tests, this data will be loaded in with a geography column named geog, and all field names will be lowercased. If you use ogr2ogr to load the file, I recommend you use the following options:
      ogr2ogr \
          -f "PostgreSQL" \
          PG:"host=localhost port=$PGPORT dbname=$PGNAME user=$PGUSER password=$PGPASS" \
          -nln phl.pwd_parcels \
          -nlt MULTIPOLYGON \
          -t_srs EPSG:4326 \
          -lco GEOMETRY_NAME=geog \
          -lco GEOM_TYPE=GEOGRAPHY \
          -overwrite \
          "${DATA_DIR}/phl_pwd_parcels/PWD_PARCELS.shp"
      (remember to replace the variables with the appropriate values, and replace the backslashes (\) with backticks (`) if you're using PowerShell)
  • azavea.neighborhoods (Azavea's GitHub)
    • In the tests, this data will be loaded in with a geography column named geog, and all field names will be lowercased. If you use ogr2ogr to load the file, I recommend you use the following options:
      ogr2ogr \
          -f "PostgreSQL" \
          PG:"host=localhost port=$PGPORT dbname=$PGNAME user=$PGUSER password=$PGPASS" \
          -nln azavea.neighborhoods \
          -nlt MULTIPOLYGON \
          -lco GEOMETRY_NAME=geog \
          -lco GEOM_TYPE=GEOGRAPHY \
          -overwrite \
          "${DATA_DIR}/Neighborhoods_Philadelphia.geojson"
      (remember to replace the variables with the appropriate values, and replace the backslashes (\) with backticks (`) if you're using PowerShell)
  • census.blockgroups_2020 (Census TIGER FTP -- Each state has it's own file; Use file number 42 for PA)
    • In the tests, this data will be loaded in with a geography column named geog, and all field names will be lowercased. If you use ogr2ogr to load the file, I recommend you use the following options:
      ogr2ogr \
          -f "PostgreSQL" \
          PG:"host=localhost port=$PGPORT dbname=$PGNAME user=$PGUSER password=$PGPASS" \
          -nln census.blockgroups_2020 \
          -nlt MULTIPOLYGON \
          -lco GEOMETRY_NAME=geog \
          -lco GEOM_TYPE=GEOGRAPHY \
          -overwrite \
          "$DATADIR/census_blockgroups_2020/tl_2020_42_bg.shp"
      (remember to replace the variables with the appropriate values, and replace the backslashes (\) with backticks (`) if you're using PowerShell)
  • census.population_2020 (Census Explorer)
    • In the tests, the initial table will have the following structure:
      CREATE TABLE census.population_2020 (
          geoid TEXT,
          geoname TEXT,
          total INTEGER
      );
    • Note that the file from the Census Explorer will have more fields than those three. You may have to do some data preprocessing to get the data into the correct format.

Note, when tests aren't passing, I do take logic for solving problems into account for partial credit when grading. When in doubt, write your thinking for solving the problem even if you aren't able to get a full response.

Questions

  1. Which eight bus stop have the largest population within 800 meters? As a rough estimation, consider any block group that intersects the buffer as being part of the 800 meter buffer.

  2. Which eight bus stops have the smallest population above 500 people inside of Philadelphia within 800 meters of the stop (Philadelphia county block groups have a geoid prefix of 42101 -- that's 42 for the state of PA, and 101 for Philadelphia county)?

    The queries to #1 & #2 should generate results with a single row, with the following structure:

    (
        stop_name text, -- The name of the station
        estimated_pop_800m integer, -- The population within 800 meters
        geog geography -- The geography of the bus stop
    )
  3. Using the Philadelphia Water Department Stormwater Billing Parcels dataset, pair each parcel with its closest bus stop. The final result should give the parcel address, bus stop name, and distance apart in meters. Order by distance (largest on top).

    Your query should run in under two minutes.

    _HINT: This is a nearest neighbor problem.

    Structure:

    (
        parcel_address text,  -- The address of the parcel
        stop_name text,  -- The name of the bus stop
        distance double precision  -- The distance apart in meters
    )
  4. Using the bus_shapes, bus_routes, and bus_trips tables from GTFS bus feed, find the two routes with the longest trips.

    Your query should run in under two minutes.

    HINT: The ST_MakeLine function is useful here. You can see an example of how you could use it at this MobilityData walkthrough on using GTFS data. If you find other good examples, please share them in Slack.

    HINT: Use the query planner (EXPLAIN) to see if there might be opportunities to speed up your query with indexes. For reference, I got this query to run in about 15 seconds.

    HINT: The row_number window function could also be useful here. You can read more about window functions in the PostgreSQL documentation. That documentation page uses the rank function, which is very similar to row_number. For more info about window functions you can check out:

    Structure:

    (
        route_short_name text,  -- The short name of the route
        trip_headsign text,  -- Headsign of the trip
        shape_geog geography,  -- The shape of the trip
        shape_length double precision  -- Length of the trip in meters
    )
  5. Rate neighborhoods by their bus stop accessibility for wheelchairs. Use Azavea's neighborhood dataset from OpenDataPhilly along with an appropriate dataset from the Septa GTFS bus feed. Use the GTFS documentation for help. Use some creativity in the metric you devise in rating neighborhoods.

    NOTE: There is no automated test for this question, as there's no one right answer. With urban data analysis, this is frequently the case.

    Discuss your accessibility metric and how you arrived at it below:

    Description:

  6. What are the top five neighborhoods according to your accessibility metric?

  7. What are the bottom five neighborhoods according to your accessibility metric?

    Both #6 and #7 should have the structure:

    (
      neighborhood_name text,  -- The name of the neighborhood
      accessibility_metric ...,  -- Your accessibility metric value
      num_bus_stops_accessible integer,
      num_bus_stops_inaccessible integer
    )
  8. With a query, find out how many census block groups Penn's main campus fully contains. Discuss which dataset you chose for defining Penn's campus.

    Structure (should be a single value):

    (
        count_block_groups integer
    )

    Discussion:

  9. With a query involving PWD parcels and census block groups, find the geo_id of the block group that contains Meyerson Hall. ST_MakePoint() and functions like that are not allowed.

    Structure (should be a single value):

    (
        geo_id text
    )
  10. You're tasked with giving more contextual information to rail stops to fill the stop_desc field in a GTFS feed. Using any of the data sets above, PostGIS functions (e.g., ST_Distance, ST_Azimuth, etc.), and PostgreSQL string functions, build a description (alias as stop_desc) for each stop. Feel free to supplement with other datasets (must provide link to data used so it's reproducible), and other methods of describing the relationships. SQL's CASE statements may be helpful for some operations.

    Structure:

    (
        stop_id integer,
        stop_name text,
        stop_desc text,
        stop_lon double precision,
        stop_lat double precision
    )

As an example, your stop_desc for a station stop may be something like "37 meters NE of 1234 Market St" (that's only an example, feel free to be creative, silly, descriptive, etc.)

Tip when experimenting: Use subqueries to limit your query to just a few rows to keep query times faster. Once your query is giving you answers you want, scale it up. E.g., instead of FROM tablename, use FROM (SELECT * FROM tablename limit 10) as t.