Classic ARC/INFO Commands in PostGIS: k Nearest Neighbors

artlembo
3 minutes ago
3 min read

In previous posts we determined the nearest neighbor and replicated the NEAR command. Sometimes, the nearest neighbor isn't enough. Perhaps we want the three nearest neighbors. We'd call that k nearest neighbor, with k, being the number of nearest neighbors we want.

Looking back at the figure from my textbook An Introduction to Statistical Problem Solving in Geography, we have our 7 points:

Instead of the nearest neighbor, we might be interested in the 3 nearest neighbors. As far as I know, ArcGIS Pro does not have this function, but I thought it would be worthwhile to show how you can achieve the result with Postgresql/PostGIS. This query has a new wrinkle to throw in using the ROW_NUMBER() and PARTITION clause, and I'll explain it below. These functions are really important when you need to perform ranking, selection, or analysis on subsets of data, rather than on the entire dataset as a whole. This enables more granular and targeted data manipulation within your queries. In our case, we want subsets of the 3 closest points for every point. Here is the query, and we'll analyze it from the inside out:

Lines 4-7: are returning the point-to-point distance of each point to every other point. Line 8 has some confusing SQL statements you may have not seen before. Let's look at it from the inside out:

Lines 9,10: In PostgreSQL, PARTITION BY and ROW_NUMBER() are used together within what are called window functions to assign sequential numbers to rows within defined groups of data. Specifically, PARTITION BY is a clause is used within window functions to divide the result set of a query into logical groups called partitions. So, in this case, we are partitioning the a.point value into a sequence of results based on the shortest to farthest distance (due to the ORDER BY). Therefore, each block of points by the point ID (a.point) are treated as a partition. ROW_NUMBER() assigns a unique, sequential integer to each row within its respective partition. In our case, the partition for point = 1 has 6 rows representing the distances to each of the other 6 points. Then, point = 2 does the same thing, all the way to point = 7. So, if we are partitioning the data in order and calling it rn, we are returning those row numbers.

Line 14: we'll skip ahead to the WHERE clause where we obtain the partitioned group where the row number (in this case, rn) is less than 3. So, this returns the first 3 records for each point number.

Line 1: here we are getting both point numbers and the minimum distance for the returned window function for the top 3 records in each partition.

And here you can see the resulting table with the 3 closest points for every point in the data set.

If this seems like magic to you, don't worry. It was magic to me, too. In fact, when I had to write the query again for another set of data, it was hard because I wasn't comfortable with the ROW_NUMBER() function. It is such a powerful tool, with many useful applications, but you are going to have to use it under a number of circumstances for it to finally make sense.

if you want to learn more about spatial SQL in Postgres/PostGIS, you can order my book How do I do that in PostGIS: Illustrating classic GIS tasks.