Introduction
The purpose of this post is to document a pattern for improving the speed of performing spatial intersections in Spark. For these examples, I will use PySpark. When working in GeoPandas, generating an R-tree spatial index and using that to improve intersection speed is a pattern well documented by posts such as this one.
In this pattern, a spatial index is generated from a point array and that is intersected with the filter polygon. The subset from that then has a “full” intersection performed against it. This reduces the number of point-to-polygon full intersections that needs to be performed.
Above: This is the pattern suggested for generating an R-tree spatial index to improve intersection performance.
Working in Spark
If we were to approach the same problem with a big dataset of points in PySpark we would likely create a UDF that would be applied row-wise. Each row would have an intersection performed between the point and the polygon.
Above: An example of what a Py function that could be used as a UDF might look like that would be applied row-wise.
Setting up a PySpark test
For the purpose of this post, let’s use the following imports:
We will also create a reference function that helps us generate an example filtering polygon. This happens to be one around downtown Denver. We will want to filter to just points in this area.
Finally, we will have a helper method that creates a series of N (1 million) random points. We will use a smaller number just for the sake of demonstration here.
Simple version of intersection
The simple version of the intersection performed on a PySpark data frame could be performed like so:
The results from this run on my local machine were as follows:
There are 2 inefficiencies here:
- Each row has to have a Python instance spun up that creates 2 Shapely objects that are then intersected and the result returned.
- All rows are processed, none are skipped.
Quadkey as a poor-man’s spatial index
One quick way to reduce the number of intersections is to find ways to partition your data. Imagine if all these points were all the points in the US, or all the points in Colorado. In either case, we want to filter down to a more fine grained area.
Preparing the data frame with quadkey indexing
One way to do this is to find the quadkey/s that contain the polygon filter. In this case, the z12 (zoom 12 quadkey) for this filter polygon is 023101030121
. We can now pre-process our points data frame to have a column that states the quadkey that is associated with each point.
Above: Here’s how we can prep the data frame to have an index column that has the quadkeys associated with each point.
Using the quadkey column with intersections
We can now use the quadkey to filter down the number of points that need to be checked in detail (with a “full” intersection). These sorts of string checks are way cheaper than the UDF with the full intersection. We can vastly improve the runtime for larger datasets with this intervention.
Is there more we can do?
We can keep taking this further if we want. We can come up with a new UDF that handles a group of x/y coordinates after a filter is produced. The advantage here is to reduce the number of rows that a UDF needs to be applied on. Such a solution requires that the filtering polygon and the segmentation of points is small enough or controlled enough to produce rows have a group of point coordinates that is small enough to be held in memory by one of the workers doing this processing.
Now, we can update our logic to first filter down to the quadkey of interest, and then to further subdivide within that quadkey and break the sublist into groups. These groups can then be fed to workers and each can have a list of points intersected with a polygon. In fact, it may be possible to even bring back the original example of a spatial intersection being created and leveraged when performing intersections between a polygon and a cluster of points at this point.
Conclusion
When working with large/big datasets, find ways to filter data down to a more reasonable size as soon as possible. With spatial intersections, marking the quadkey of each point might help with a spatial intersection by allowing you to quickly remove immediate outlier points without having to construct a spatial index or otherwise track and index all the points that lie outside the target quadkey/s that encapsulates the polygon being used to filter the points.