Introduction
The intent of this post is to demonstrate how one might use a hierarchical system (in this case, S2 cells) to summarize discrete agent trips in a summary fashion that attempts to address privacy concerns by agglomerating sparse trace patterns at higher levels of aggregation.
That is, by reducing the accuracy with more coarse approximations where a plurality of location data is lacking, trace attributes can be preserved and summarized over larger areas where data volume is deemed sufficient for representation.
Details
This post outlines first how to create some synthetic trace data to work with, then how to execute the anonymization outlined in the above. Note that the intent is to broadly capture how this might be done, not to present this as a de facto method that should be replicated in a real world use case. Most importantly, I want to highlight the opportunity for there to be a function “inserted” that would allow for the threshold that determines sufficiency of data density to be calculated dynamically (and to be managed as a user-defined input).
What might be an acceptable level of accuracy and aggregation for one use case may not be appropriate for another.
Creating synthetic data
To create a synthetic trace dataset, I used this trip simulator and followed the README
, generally.
First, I pulled down a target city’s PBF. In this case, I pulled down St. Louis, Missouri. I trimmed the extract to a subset of the whole metro area, so as to just simulate traffic around the downtown area.
I then npm installed OSRM, the OpenStreetMap routing engine and processed the road graph.
Finally, I ran a configured trip-simulator
run on this subset of the road network with 100 agents to create a series of traces represented in the output trips.json
file.
Aggregating trip traces to hierarchical summary format
At this point, I now have trace data simulated for a given period of time. I want to take this trace data and aggregate the trips so that they exist at a level of detail that avoids exposing personal information based off of a threshold that is customizable to the use case.
To start, let’s read in the data as a GeoDataFrame in Python.
We can observe the result of this by plotting the data. In blue is the individual point data and in red are the line strings that are created by converting each of these traces to complete line string-traced trips.
The darker red lines have more trips associated with those segments.
Pairing traces with S2 cell
For each trace point, we want the smallest S2 cell associated with it. From that cell, we then want all that S2 cells parent cells (so from the highest zoom all the way out to the maximum zoom).
To do this, we will use a library s2sphere
that facilitates working with the concept of the S2 cell hierarchy.
Now we can look at each zoom level and see how many values are in each cell. We can create a rule system here that, for each zoom level, if there are enough points, we can assert that this zoom level for a certain S2 cell is “ok” to pass through to the output.
Note that we could also make this logic dynamic where, perhaps, thresholds are lower at lower zooms. For example, even if there is just one trace, we allow it to be rendered at a very coarse level (say, a cell id that is the scale of 100 square miles or something).
Now, each trace point has associated with it a list of parent cells at progressively lower zooms. It also has a register that indicates what zooms it is ok to show this trace at. From this, we can determine what the highest acceptable zoom is, based on the parameter we came up with and injected in the previous section.
Now we know what the zoom is (and, as a result, the index of the parent cell id from the parent cells list).
Since we have all traces in this GeoDataFrame, we want to know create paths that are based off the highest accepted zoom level for each trace for each unique trip or agent. In this case, we can start by doing this for each agent by using the vehicle_id
parameter which identifies each agent’s trips.
Rendering the result
At this point we can render all those trips as their S2 cell representation by converting the S2 cell ids into shapes and plotting those shapes.
Now, this result can be a bit visually overwhelming. As a result, it may help to just look at how one trip has been converted. Shown is the road network, the trace for a single vehicle id, and the associated S2 cell conversion.
Next steps
This is just a simple aggregation concept. In the end, segments of multiple trips that are common could be agglomerated amongst multiple sources and, from these sources, typical runtimes for that set of S2 cells could be generated.
From this, one might imagine that typical ETAs could be generated from a given S2 cell to another at certain levels and typical traffic patterns associated without needing to sacrifice privacy.
Additional steps, such as anonymizing agent trips or further distilling travel patterns could be pursued. Again, this is just a very high level concept of how one could begin to consider whole trips without risking personally identifiable information.