Note: This is the second part in a two-part series. The first section is located here: Programmatic Geometry Manipulation to Auto-generate Route Splines.
This is part of a 4 part series, realted to deep diving into Flocktracker Bogota data. All 4 parts in the series are:
- Cleaning and Analysis of Bogota Flocktracker Data
- Programmatic Geometry Manipulation to Auto-generate Route Splines
- Tethering Schedules to Routes via Trace Data
- Synthesizing Multiple Route Trace Point Clouds
Introduction to Part 2
In this post, we will be working with the same data from the previous post on converting trip trace data to route splines and paired GTFS schedule data. The intent of this post is to sketch out the logic behind how I’ve used the outputs from the route spline generation to auto-generate speed zones along discrete routes and use those to infer time cost which can then be used to generate GTFS for a given route.
When referring to the “current algorithms,” I will be referring to the state of the ft_bogota
Github repo I created to support these analyses. The state of the current build of the sketch utility repo when this was written is at commit 10d3a9f5e88e8be36911dd8622ed1a8391d7a3fd
. Reviewing the repo after this commit may reveal significant differences. It is important to understand this post more as a proposal with a functional sketch system built out than a finalized implementation of such a utility package.
A notebook that provides something for readers to follow along with exists here as a Gist. I would just like to warn that it’s very much a working document so please excuse the dust. For this post, we will be starting from In [585]
. The top of the cell should look like this:
Introducing the order of operations
The following will be a series of Python operations that are to be run on the output of the trip_pairings
dictionary. Essentially, these operations can be wrapped in a method do_something()
and executed once for each key
in trip_pairings.keys()
, with key
as an argument variable.
The trip pairings variable was name unique_trip_id_pairs
in the previous post and represents all related trip IDs for a given route that has been determined via the route spline generation method (again, from the last post; so please read it first before this post).
Description of initial operations for each key
First we pull pull out the list of valid trip IDs (t_list = trip_pairings[tkey].append(tkey)
), where tkey
is the key from the trip pairings object whose keys are being iterated over.
We subset the original cleaned Bogota dataset and extract just the related trip ID rows.
Now, we are going to create a reference GeoDataFrame. This will hold all trip shapes for this route. Each will have a number of metadata attributes included in their row as well. These will be used to create averages of speed over segments.
Description of internal trip-unique for loop
Just like we did in the last blog post, there’s a fairly long winded segment of this process that lies in a for
loop. What I am going to do is show the whole process and tag each step as a “Part.” Each part will then have a subsegment below that outlines what was done there.
Part 1
Here, we convert the subsetted data frame of only relevant trip IDs and iterate through each unique trip ID in that list. For each, we first convert that into a single GeoDataFrame of just that trip. We can then get the distance between each point.
Part 2
In this step, we take all those Shapely Points and calculate the distance between each for the trip. Note that this data frame is already sorted by date, which is why we do not need to sort it again. The zip
method will be successful because each subsequent point comes “after” the preceding in terms of the time they were logged.
Part 3
Similarly, we can use the Pandas .shift()
method to get the time difference between each point. In the case of the Bogota data, they should all be 5 seconds apart, but this allows for the possibility that there may have been some technical error that caused one point to be omitted.
Once we have the time difference data, we can also calculate the speed by using the distance data from Part 2. This, the distance data itself, and the time are all added to the parent GeoDataFrame.
Part 4
With the resulting GeoDataFrame, we want to make sure we do not have ridiculous results. Thresholds have been set to prevent any points from exhibiting excessive speeds. That said, this should not occur because I have cleaned the Bogota data and removed all outliers.
Part 5
With this cleaned and prepared GeoDataFrame, we can update the final_overall_gdf
reference by appending these new rows of processed trace data.
Identifying speed zones
At this point we want to get the route LineString shape for the target route. We want that and we simplify it. In the case of the degrees-projected Bogota dataset, we simplified by 0.0025:
With these results, we can a bunch of discrete segments (as shown in the image above).
With each of these buffered segments (as shown, buffered, in the rightmost plot above), we can then get descriptive stats for all speeds in that zone:
In the above code snippet, I iterate through all rows from the prepared trip traces and pull out those that are in the speed zone. I take the mean and median and use the average between the two as the speed for that segment. This could be changed to however you as a user feel would generate a most appropriate descriptive speed for this zone. Depending on the amount of data that you have on hand, you could also do a peak hour and off peak hour speed for that leg of the route journey. In my case, all times are assumed peak.
Generating final costs
Remember that final_costs_gdf
reference from the very beginning? Let’s begin to populate it.
First, we need to create a number of reference lists.
Each of the above lists will become a column in the resulting GeoDataFrame. The first 4 are all related to the original route points (which we will use as stops), their coordinates, and the trip ID they are paired with.
In the next segment, we iterate through the LineString and update the route shape reference as a GeoDataFrame with the summary speed and time costs between each of those coordinate points (stops for the GTFS).
In an attempt at quality control, we only add to final_costs_gdf
if the total length is of a significant distance and is not just someone turning the app on by accident and walking around at the start of finish of their route trip by accident.
An improvement that could be made here is to have intermediary stops along the line segments if they exceed a certain distance. I could actually introduce this upstream by taking the simplified geometry and breaking up component lines that are over a certain length threshold.
Generating the GTFS
Now that we have the time cost calculated for each of the routes, converting this into GTFS is pretty straightforward and more a matter of just complying with the GTFS format. In the following sections, I’ll show how to generate each of the required files and include any relevant notes.
Stops table
Again, we use the trip ID as the unique ID for a give route and just strip the “T” from the name. “X” and “Y” coordinate values were already broken out from the LineString in the above section’s for
loop.
Routes table
Generating routes tables is also straightforward and more a matter of making sure that naming conventions are consistent with the other tables. I’ve put all these informal routes under the same operator, hardcoded as “FT_BOGOTA”. There is no informal bus route type, so they all were designated the standard bus type 3 for GTFS.
Shapes table
This table is easier than it might seem. All that needed to be done is to add each LineString point as a row in the data frame and link it to the route (and also add sequence number).
Calendar table
Because we do not have enough data about operation, the calendar data just assumes a “GENERIC” default of having service every day for all routes.
Trips and Stop Times table
These are the only complex tables to create. They need to be generated in tandem because each trip needs to be paired with a unique schedule of arrivals and departures from all stops. As a result, for each new trip entry in the trips table, we need to create all relevant stops arrivals for the stops table under that trip ID.
The below is the entire workflow. The loop should be fairly straightforward. You will notice that I’ve only produced one hour (8:00 AM) of stop time data. This could be adjusted according to what window you wanted to generate data for. Naturally, these aspects could be parameterized and all wrapped in a more clean generator function. Before that is built though, more consideration needs to be directed at thinking about how to handle different schedules for different routes based on more robust observation data (e.g. data covering a single trip or route over multiple days and on weekdays and weekends).
Closing thoughts
The GTFS generation is still rickety but, as I hope is clear from this post, is pretty straightforward once we have the route data. I’d say that it is the route shape data that is most critical. Once we know where the routes are, getting and refining a schedule for these shapes is more a matter of data than a technical challenge generating discrete references from an assemblage of partial data.
If you made it this far, thanks for reading!
This is part of a 4 part series, realted to deep diving into Flocktracker Bogota data. All 4 parts in the series are: