Every geographer knows Tobler’s Law. Near things are more similar to things that are far away. From this, we often begin our discussions of spatial autocorrelation. I’m going to actually show you how to do join count and Moran’s I in SQL, but today I want to show how we can create a variogram using SQL. But, before you get too excited, I want to present a different take on the variogram, and that is the perspective of the geographer and not the geostatistician.
Here is the problem, the variogram is a really cool tool to assess the spatial relationship that features have with one another. But, usually only reserve the variogram for higher level geostatistics and most geographers run away screaming from it. What if instead we simply used the variogram as a descriptive measure for how a geographic feature correlations over space.
In my book Introduction to Statistical Problem Solving in Geography, we introduce the variogram in this simplified descriptive way. To do that, we illustrate the idea with the last spring frost dates in the Southeastern United States:
We show the students how a simple variogram illustrates how the last spring frost dates are spatially correlated from a descriptive standpoint. In our dataset, one can see that the data is spatially correlated up to around 400 miles, but after that, it becomes random.
Surprisingly (although if you have been reading this blog, not surprisingly), it is relatively easy to generate a variogram in SQL. The code is (assuming we have a set of point in a table called ‘p’ with an attribute field [lsf]):
SELECT dist,avg(abs(diff))/2 AS semivariance FROM (SELECT Floor(DistanceEarth(P.id,P3.id,”mi”)/50) AS dist, (p.[lsf] – p3.[lsf]) AS diff FROM p, p AS p3) GROUP BY dist
Once again, we will dive into the sub-query. We want to compare every point with every other point, so we need two tables. Issuing the FROM p, p AS p3 portion will treat the table [p] like it is another table called [p3] – sneaky. We are also computing for each point the distance between itself and every other point in the file in miles. In addition, we are looking at the differences between the last spring frost date and called that result diff.
As for the distances, we are determining the distance between each point, and dividing it by 50 and also using the Floor function. This is actually going to truncate the points at 50 miles (this is how we trick SQL into giving us lag distance.
Now, all of this is finally wrapped into the aggregate clause that shows the differences in distance grouped by the lag (multiples of 50) and their average differences. The query returns two columns which we use as a scatterplot to show our vairogram.