In this blog post I will outline a methodology for calculating the straightness of a transit systems routes. The method outlined here will also allow for the ranking of routes by their straightness, as defined by a method that is compartmentalized and could be easily tweaked and updated with improvements later on.
Importance of Straightness
Transit Center blogged and tweeted last year the phrase “Straighter is Greater.” It was in response to the presence of convoluted (and thus ineffective and potentially wasteful routes) routes in transit system all across the US and internationally. Citylab did a nice piece on it in the Fall of 2016. The article references a 2013 Transportation Research Board (TRB) paper that noted that three-quarters of the studied transit agencies in the report tried adjusting their bus routes in response to scheduled and performance related issues.
In light of the “Straighter is Greater” push, I thought it might be helpful to make a tool that ranks routes by straightness. This same tool could also be used to compare across systems and cities, too.
The example shown in this blog post allows for the analysis of a single GTFS feed. This script could easily be employed to, say, create a ranking of straightness amongst all operators who publish GTFS feeds. That said, the intent here is to outline how to utilize open source tools to extract, clean, and load GTFS schedule data into a NetworkX network graph to enable per-route and per-route-shape level straightness analysis.
For the purposes of this exercise, you can use any GTFS zip feed available. TransitLand has a great API that allows you to easily discover the latest valid feed from operators all over the world. For this post, I’ll use San Diego’s transit feed since I grew up there. The below script will pull down San Diego’s GTFS feed and convert it into a series of Pandas dataframes.
Great, now that we have a Pandas representation of the schedule data, we can infer nodes and edges from that data and create a network graph. Because we are dealing purely with locational data (that is, the shape of the route) data such as schedule times will not be relevant for this exercise.
Creating the Network Graph
First, let’s generate a sorted list of unique route ids.
Just in case you do choose to use San Diego’s feed, your output should match the above if you print the contents of
sorted_unique. This information is valid as of July 30, 2017.
Now that we have each route, we will need to separately extract each shape related to that route. Routes in GTFS can have multiple shapes. For example, there could be a separate inbound and outbound path for a given line.
all_rts_shapes now has keys for each route id, each of which is a dictionary itself holding the keys for each shape id within each route. Each of these is associated with an array of points along the shape of the route path holding associated metadata. This information will be used to create both the nodes in the transit system, as well as the edges between these nodes for each route.
We can now create a directed graph representing the transit network. The resulting network is simply created, with nodes being applied by a per-shape basis. That is, if there are spatially identical nodes between shapes or routes, we do not account for that in this method. These operations could all absolutely be refactored for a more performant and contextually aware network. That said, for the purposes of generating transit straightness rankings, this should do for now.
Running Bearings Analysis
Now that we have a network graph in memory, we can plot it and sanity check our work so far. I decided to use OSMnx to do the plotting as it’s styles are nice and it makes handling reproductions and other styling issues we will be encountering downstream a breeze.
With just the above line, the below image will be generated. This is an unstyled plot of the transit network in San Diego. You can see how the city hugs the coast, with inland service fanning out widely the farther east into San Diego the routes go.
Once we have the entire network available, we also open up other analyses to be performed easily. Again, I recommend OSMnx for some really fast, effective exploration of the network representation of the GTFS feed. Although it was designed original to really with Open Street Map data, GTFS lends itself to an almost identical use case.
The below operation allows you to visualize the distributions of bearings for each route. I think this is particularly interesting because, in a more strictly grid environment (like, say, Midtown Manhattan), 4 sharp peaks would emerge as the road network tends to be quite consistent. On San Diego’s transit network, however, the north-south tendencies remain sharp (the two steep peaks) while east-west distributions are muddied and less discrete. The below script will generate the above histogram.
Just like we generated the histogram, we can add this metadata to each edge in the network and plot the data accordingly.
In the above image, we see a histogram of San Diego’s overall network bearings distribution. As you can see, the peaks are a bit more pronounced running east-west than when compared with the transit networks. Anecdotally, this is reflected in what I have observed over my life riding on San Diego bus and light rail system - effective east-west bound commutes are a rarity at best.
Above, a less-useful plot of the bearings for all edges in the road network for San Diego. At this scale, this plot loses much/any usefulness.
Generating Straightness Scores
Below is the output we are headed for with this section:
What does the above mean? Each section is a list of the most straight and least straight transit routes in San Diego’s GTFS feed, respectively. The second value next to each route id is a score that is calculated. How this is calculated will be explained in the below. The final item represents the average number of discrete “runs” that each route has for each of its shapes (each path it can take for a given complete trip for that route). Each discrete run represents a period of uninterrupted “straightness.” As you might expect, less straight routes have more segments because sharp turns in the route break up the straight segments into (more) smaller ones.
In order to begin assessing straightness we will first need to organize the edges outside the graph. There’s definitely better ways to do this, but this is again a prototype and performance was not a concern (nor was type safety, testing, error handling, or DRY code). That said, I still believe this does an acceptable job of demonstrating the proposed methodology encapsulated in a script that is actually functional.
Now that we have the relevant edges with shape ids nested, in a reference dictionary, we can set a global
TURN_DEGREES_THRESHOLD value. With this value, we can make an argument for what we define to be a turn that triggers a new segment, in terms of degrees. For today, we will set that threshold at 30 degrees. So,
TURN_DEGREES_THRESHOLD = 30.
With this information, we can now execute a loop through all routes and shapes to generate a straightness by route score. The score will be calculated by first capturing the bearings for each edge in the path, ordered by event sequence along the route.
Next, a boolean array is generated to reference whether a sharp turn has occurred between each segment.
This is probably the grossest part of this script, right below. Either way, the gist of this segment is that it breaks up the shape into discrete straight segments and counts up the number of turns.
At this point we can create some shape level summary values. These are then appended to the overall
segments_counts arrays for the route under review. Each shape gets a straightness score which is the reciprocal of the count of segments in that shape. Typically all shapes are roughly the same and will result in the same score, but this is intended to account for any variation that may occur on a route going, say, one direction or the other.
Once this is done, we can create some route level scores based on the data we extracted from its underlying shapes. The below method demonstrates the (pretty lame) algorithm being used. All we do is take the mean of all the shapes’ independent scores and allow for the squared proportion of the longest continuous straight run in the route to the length of the route, cumulatively. The resulting score is between 0 and 1 but is valuable only in its relative standing to other routes or systems.
Let’s take the top five elements and the bottom five and print them, like we did at the top of this section. By simply performing a list sort on the dictionary items, we can order the results from highest to lowest score for the routes in San Diego.
The rankings are cute and all, but we did a bit of work to get here. So, let’s plot these results and get something visual we can show off our work with!
The above image is generated using the “viridis” color scheme. This scheme shows brighter routes as those higher scoring and darker routes as those with lower scores.
The viridis color scheme runs purple to yellow and, as can be seen the plot, yellow correlates with more straight routes (those with the higher scores).
In order to create this image, a new graph must created. In the new graph, edges include an
rs_score attribute which is the route score that the edge is associated with. Using this attribute, OSMnx has a wrapper around Matplotlib that will generate the above plot nicely. Below is the script to do so.
The primary point of this blog post is to demonstrate a methodology. The scoring mechanics need to be refined but, at this point, that is easily achieved by just updating the scoring logic. By that point, we have much of the information we may need for any adjustments to how we evaluate the shape of the routes.
One quick way to review the results, beyond the scores list, is to plot the head and tail of the leaderboard for route straightness as I calculated it.
Top 5 routes: Routes 510, 530, 520, AIR, 5
Bottom 5 routes: 891, 892, 888, 894, 11
Results seem to make sense from just looking at the above. Longer routes that provide extended service rather than ones that slip around corners in downtown seem to fair better through these metrics. But, downtown routes do not fall too far behind. Rather, it is the far flung east San Diego rural service routes that result in the poorest scores as a result of these metrics.
Want a notebook profiling all these steps in order? Available here.
Updates from the Twitter-sphere
Thanks to all those who read this article and made comments online. In particular, I wanted to highlight this response which pointed to an academic journal on urban network circuity. I’d suggest clicking on this tweet to view the whole thread, which includes a direct link to the article.
The article dives far deeper into the observation I made in the above post with regards to how San Diego’s road network demonstrated greater consolidation of bearing directionality along east-west cooridors than did the transit network. In the paper, circuity is analyzed for both trip and transit routes and their relative disparity is then scored against other cities in the US. My favorite takeaway from the article was the observation that Houston has a relatively circuity relationship between autos and transit compared to other American cities.
Next, I’d like to set this up to run for all feeds in all cities in at least the US. This isn’t so easily done as there are tons of edge cases when it comes to consuming GTFS feeds and I’ll need to update the script accordingly. I spent some time to do so for New York. The results for Brooklyn are shown below:
Swapped out SD > Brooklyn. In BK, routes down the borough core fare better than inter-borough routes.— Kuan Butts (@buttsmeister) July 30, 2017
Next (time permitting): all US feeds! pic.twitter.com/0jYOVysmfG