k-Nearest Neighbor with PostGRES

artlembo
Apr 26, 2016
3 min read

Well, even though you didn’t ask for it, I decided to create a k-nearest neighbor analysis in PostGRES/PostGIS. This has lots of potential: find the 5 nearest ATMs to every fast food restuarant; find the 3 closest medical clinics to each school; or find the 10 closest bars to each college.

Now, the LIMIT clause does allow us to find the k closest features to a particular feature. But in this case, I’m not interested in a particular feature, I want the results for all features.

To do this in Postgres, I’m going to show you two things that I haven’t used until today: rank() and partition.

This entire query falls under a heading of a Window function in Postgres.

Modified from Postgres help:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.

Here is the code we’ll talk about:


SELECT * FROM
   (SELECT park_no1, park_no2, dist,
          rank() over (partition by park_no1 order by dist asc) as rank
    FROM
      (
       SELECT a.name AS park_no1, b.name AS park_no2,
              ST_Distance(a.geometry, b.geometry) AS dist
       FROM parks AS a , parks AS b
       WHERE a.name &amp;lt;&amp;gt; b.name
       ) AS T1
     ) AS T2
WHERE rank &amp;lt; 4

And our results are here:

Table 1. Each park is selected with it’s three closest parks along with the distance to the park.

Now let’s talk about it…

Getting the distances

Lines 6-10 is your basic SELECT statement where I get the distances from each park to every other park, and sort it in ASCENDING order (that way, we get the closest parks first). I add to that two little tricks I’ve shown you countless times:

renaming the parks table A and B to trick the SQL engine into thinking there are two tables
not performing the distance calculations when the names are the same (to avoid 0 distances).

Partitioning the data

This one was new to me. Partitioning is really cool and refers to splitting the resulting table returned by the query into smaller physical pieces. This is part of the Window capabilities in Postgres. In this case, I am partitioning the data into groups by the park_no1 value (the park name).

Rank the data

So, we’ve grouped out the data based on the park_no, so now we take those interesting grouped pieces and put a rank on them. The rank() function does just that. But Postgres wants to know what we want to rank. In this case, we need to use the OVER clause. The OVER clause determines exactly how the rows of the query are split up for processing by the Window function we are using. So, for each group of parks, we now have a rank from 1..N for the group like the following:

Table 2. Result of ranked query, showing each ranking for Auburn Park and their n nearest neighbors.

Table 3. Scrolling further down the table, you can see that Auburn Park is ranked from 1..N. When we get to Baker Park, we start again at 1.

So now we have this really big (nxn) X 4 table. But remember, we want the closest three parks. So, the thing to do is wrap the entire query in parenthesis, and then select all records with a rank column less than 4. The resulting table is shown below.

Table 4. Here we see each park, and their three closest parks.

Don’t worry if you don’t understand this. Like they say at the end of the TV show Dragnet: the names have been changed to protect the innocent. So, just copy this code in, and start changing names. If you are working with burglaries and not parks, then simply change the name of the parks table to burglaries. You can also change my unique name for the park with whatever your unique name is. The, when you have it working for your data, you’ll probably have a better shot at understanding it.

Want to learn more about writing SQL? Use a coupon code to take my course on Spatial SQL for under $30 right here.

Also, tell your friends. If I can get 100 people to take the course in the next month I’ll ramp up more posts like this 🙂

k-Nearest Neighbor with PostGRES

Recent Posts

Comments