GTFS data segmentation using R #219

subha-nair · 2024-12-08T23:01:12Z

subha-nair
Dec 8, 2024

Hi, is there a way to cut the GTFS shapes file into segments where they overlap, and visualise the overlapped lines as a single line with frequency of bus routes on them (shown as line thickness). I am new to R so a step by step process would be helpful.

polettif · 2024-12-17T07:39:48Z

polettif
Dec 17, 2024
Maintainer

There is a way but unfortunately there is no easy step-by-step workflow to do this. The main issue is identifying overlaps. If your feed has precise shapes.txt coordinates you can use sf::st_segmentize() to create segments. You then compare all segments. I'd guess you'd like to summarise shapes that are "close-by but not equal", then you'd have to use st_distance(..., which="Hausdorff") or something like that.

So it's mainly a not-so-trivial geospatial issue. As you're new to R I'd suggest looking at Geocomputation with R, I don't think I can help you with that further.

Summarising the number of trips (not routes, as a route might have more than one trip with a corresponding shape) on a shape can be done like this.

library(tidytransit)
library(dplyr)

g = gtfs_duke |>
    gtfs_as_sf()

trip_counts = g$trips |>
    distinct(trip_id, shape_id) |>
    count(shape_id, name = "n_trips")

shapes = left_join(g$shapes, trip_counts, "shape_id")

library(ggplot2)
ggplot(shapes) +
    geom_sf(aes(linewidth = n_trips))

^{Created on 2024-12-17 with reprex v2.1.1}

Note that in this plot there are several overlapping shapes so the linewidth doesn't really show the right number of trips. So this is just a starting point and doesn't solve the issue. To fix overlaps you'd have to solve the geospatial issue.

2 replies

idshklein Dec 17, 2024

if you could give an example workflow to the hausdorf method mentioned it will be appriciated

polettif Dec 17, 2024
Maintainer

As I said, it's not a trivial example to do. If you want to do matching you need to link segments with shapes (which then can be linked to trips and routes). You can create segments like this:

library(dplyr)
library(tidytransit)

g = gtfs_as_sf(gtfs_duke, crs = 32119) # EPSG code for the feed area

shape_segments = g$shapes |>
	sf::st_segmentize(50) |> # create segments with at most 50 m distance 
	nngeo::st_segments() |> # split shape linestrings into segmented linestrings
	mutate(segment_id = paste0(shape_id, "_", row_number()), .by = shape_id)

After that it essentially becomes a map matching problem and thus heavily depends on how your data is structured. Do you want to match each direction independently? What's the maximum distance you consider two shape segments as "equal"? Do you want to match routes to existing path segments (if so, which segment should be used?), to newly created segments or another data source?

I'm sorry if this is not really helpful but in my experience gtfs feeds rarely provide clean enough data to use without quite some experimenting and fiddling. I did try it with the duke dataset for a bit but didn't get it to work. Also, my initial guess with Hausdorff distance probably doesn't work since two consecutive segments have a distance of 0 but you probably don't want to match those.

polettif · 2024-12-19T17:03:52Z

polettif
Dec 19, 2024
Maintainer

In the spirit of advent of code I couldn't help but tinker with this a bit more 🎄🎅 🎁

The following code matches nearby shape points and finds overlapping shapes along segments. It looks ok but I didn't inspect the plausibility of the example dataset further and there might be pitfalls with other dataset. So it's not a catch-all workflow but it should provide a starting point.

Snap nearby shape coordinates to allow matching

library(dplyr)
library(sf)
library(tidytransit)

# Add interpolated points along shape lines
add_shape_points = function(gtfs_shapes, crs, max_segment_dist) {
    gtfs_shapes |>
        shapes_as_sf(crs) |>
        st_segmentize(max_segment_dist) |>
        tidytransit:::sf_lines_to_df() |> # direct access fn in tidytransit::sf_as_tbl()
        as_tibble()
}

# Build hierarchical clusters with all gtfs-shape points and combine them
# within max_dist. The returned `shapes_snapped` has the same number of shape points
# but their lonlat coordinates have been change to their cluster's center.
snap_simplify_shapes = function(gtfs_shapes, crs, max_cluster_dist) {
    stopifnot(inherits(gtfs_shapes, "data.frame"))
    stopifnot(!inherits(gtfs_shapes, "sf"))

    # We don't want to convert shapes to actual lines (with gtfs_sf_sf() or shapes_as_sf()),
    # the point coordinates in shapes.txt are actually already fine for our use case.
    # However, we need to convert them from lon/lat to a projected coordinate system,
    # for the duke example feed it's EPSG:32119
    shapes = gtfs_shapes |>
        st_as_sf(coords = c("shape_pt_lon", "shape_pt_lat"), crs = 4326, remove = FALSE) |>
        st_transform(crs)

    # add point coordinates as columns, round coordinates
    shapes <- shapes |>
        bind_cols(round(st_coordinates(shapes), 1)) |>
        st_drop_geometry() |>
        mutate(shape_point_id = paste0(X, "-", Y))

    # use only unique point coordinates, reduce clustering runtime
    coords = shapes |>
        distinct(shape_point_id, X, Y)

    # hierarchical clustering
    htree = cluster::agnes(coords[,c("X", "Y")])

    cluster_coords = function(htree, coords, k) {
        coords$cluster <- cutree(htree, k = k)

        # Use mean coordinates as cluster center
        # Maybe this data is already available somewhere with the cluster package
        coords_clustered = coords |>
            group_by(cluster) |>
            mutate(cluster_X = mean(X), cluster_Y = mean(Y)) |>
            ungroup()

        # calculate euclidian distance to the cluster center
        cluster_stats = coords_clustered |>
            mutate(dist_to_mean = sqrt((X-cluster_X)^2+(Y-cluster_Y)^2)) |>
            summarise(min_dist = min(dist_to_mean), max_dist = max(dist_to_mean)) |>
            ungroup()

        attributes(coords_clustered)$cluster_stats <- cluster_stats
        attributes(coords_clustered)$max_dist <- max(cluster_stats$max_dist)
        coords_clustered
    }

    # find k value where distance to cluster center is less than max_dist (for all clusters)
    # first increase k in bigger intervals, afterwards finetune k back in 1-step-decrements
    for(k in seq(1, nrow(coords), 100)) {
        coords_clustered = cluster_coords(htree, coords, k)
        if(attributes(coords_clustered)$max_dist < max_cluster_dist) break
    }
    for(k in seq(k, 1, -1)) {
        coords_clustered = cluster_coords(htree, coords, k)
        if(attributes(coords_clustered)$max_dist > max_cluster_dist) break
    }

    # join shapes with new clustered coordinates and transform back to WGS 84
    shapes_snapped = shapes |>
        inner_join(coords_clustered, c("shape_point_id", "X", "Y")) |>
        select(-shape_pt_lat, -shape_pt_lon, -X, -Y) |>
        st_as_sf(coords = c("cluster_X", "cluster_Y"), crs = crs) |>
        st_transform(4326)

    # use new cluster lon/lat coordinates and trim columns to original columns
    shapes_snapped <- shapes_snapped |>
        bind_cols(round(st_coordinates(shapes_snapped), 6)) |> # up to 11 cm accuracy
        st_drop_geometry() |>
        rename(shape_pt_lon = X, shape_pt_lat = Y)

    shapes_snapped <- shapes_snapped[,colnames(gtfs_shapes)]

    return(shapes_snapped)
}

With the functions out of the way, let's simplify the shapes for the example gtfs_duke dataset:

# Option 1: snap shapes with existing shape point coordinates
# new_shapes = snap_simplify_shapes(gtfs_duke$shapes, 32119, 15)

# Option 2: add more points along shape line to better match lines
# between original shape points
new_shapes = gtfs_duke$shapes |>
    add_shape_points(32119, 20) |>
    snap_simplify_shapes(32119, 30)

Compare original shapes to snapped shapes:

mapview::mapview(shapes_as_sf(gtfs_duke$shapes), color = "blue", legend = F)@map |>
    leaflet::setView(-78.934, 36.005, 17)

mapview::mapview(shapes_as_sf(new_shapes), color = "red", legend = F)@map |>
    leaflet::setView(-78.934, 36.005, 17)

Find overlapping segments of shapes

# Create "segments" between individual shape points and create a table linking
# shape_ids and newly created segment_ids. Each shape has 1:n segments (probably
# way more than 1) and each segment has 1:m shapes. Segments are one-directional.
# You can build segments with the original shapes table (i.e. without snapping the
# shape coordinates in the previous step if data is already good enough
get_shape_segments = function(gtfs_shapes) {
    gtfs_shapes_offset = gtfs_shapes |>
        mutate(shape_pt_sequence = shape_pt_sequence-1) |>
        filter(shape_pt_sequence > 0) |>
        select(shape_id, shape_pt_lon, shape_pt_lat, shape_pt_sequence)

    # use linestring wkt as id
    # Note that there are overlapping segments with different directions (A->B and B->A) 
    # which are handled separately in the following workflow. To change this we'd need to
    # create the same id for both directions (for example by sorting the coordinates before hashing)
    create_segment_wkt = function(x1,y1,x2,y2) {
        sprintf("LINESTRING (%.6f %.6f, %.6f %.6f)", x1, y1, x2, y2)
    }

    gtfs_shape_segments = inner_join(gtfs_shapes, gtfs_shapes_offset, c("shape_id", "shape_pt_sequence")) |>
        mutate(segment_id = create_segment_wkt(shape_pt_lon.x, shape_pt_lat.x, shape_pt_lon.y, shape_pt_lat.y)) |>
        distinct(shape_id, segment_id)

    return(gtfs_shape_segments)
}

(shape_segments = get_shape_segments(new_shapes))
#> # A tibble: 8,355 × 2
#>    shape_id segment_id                                             
#>    <chr>    <chr>                                                  
#>  1 p_111287 LINESTRING (-78.914605 36.006000, -78.914605 36.006000)
#>  2 p_111287 LINESTRING (-78.914605 36.006000, -78.914368 36.005660)
#>  3 p_111287 LINESTRING (-78.914368 36.005660, -78.914368 36.005660)
#>  4 p_111287 LINESTRING (-78.914368 36.005660, -78.914435 36.005321)
#>  5 p_111287 LINESTRING (-78.914435 36.005321, -78.914435 36.005321)
#>  6 p_111287 LINESTRING (-78.914435 36.005321, -78.914761 36.005071)
#>  7 p_111287 LINESTRING (-78.914761 36.005071, -78.914761 36.005071)
#>  8 p_111287 LINESTRING (-78.914761 36.005071, -78.915102 36.005272)
#>  9 p_111287 LINESTRING (-78.915102 36.005272, -78.915102 36.005272)
#> 10 p_111287 LINESTRING (-78.915102 36.005272, -78.915571 36.005285)
#> # ℹ 8,345 more rows

segments = shape_segments |>
    distinct(segment_id) |>
    mutate(segment_wkt = segment_id) |>
    st_as_sf(wkt = "segment_wkt", crs = 4326)

Analysis workflow, find number of trips per segment

# trips per day
trips_per_day = gtfs_duke$trips |>
    count(service_id, name = "n_trips_on_service") |>
    inner_join(gtfs_duke$.$dates_services, "service_id") |>
    summarise(n_trips = sum(n_trips_on_service), .by = date)

# Example with how many trips run on a given date
n_trips_per_shape = gtfs_duke$.$dates_services |>
    filter(date == "2019-09-05") |>
    inner_join(gtfs_duke$trips, "service_id") |>
    count(shape_id, name = "n_trips")

n_trips_per_segment = shape_segments |>
    inner_join(n_trips_per_shape, "shape_id") |>
    summarise(n_trips = sum(n_trips), .by = segment_id) |>
    inner_join(segments, "segment_id") |>
    st_as_sf() |>
    arrange(n_trips)

# still a bit misleading because overlapping directional segments
library(ggplot2)
ggplot(n_trips_per_segment) +
    theme_bw() +
    geom_sf(aes(linewidth = n_trips, color = n_trips), lineend = "round")

^{Created on 2024-12-19 with reprex v2.1.1}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTFS data segmentation using R #219

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

GTFS data segmentation using R #219

subha-nair Dec 8, 2024

Replies: 2 comments · 2 replies

polettif Dec 17, 2024 Maintainer

idshklein Dec 17, 2024

polettif Dec 17, 2024 Maintainer

polettif Dec 19, 2024 Maintainer

Snap nearby shape coordinates to allow matching

Find overlapping segments of shapes

Analysis workflow, find number of trips per segment

subha-nair
Dec 8, 2024

Replies: 2 comments 2 replies

polettif
Dec 17, 2024
Maintainer

polettif Dec 17, 2024
Maintainer

polettif
Dec 19, 2024
Maintainer