When More is Less…. lessons from processing large data files
My good friend Stuart Hamilton gave me a fun conundrum to try out. He has a file of province boundaries (400 areas) and lidar derived mangrove locations (37 million points – 2.2GB in size). He wants to find the number of mangroves that are contained in each area. He also want to know which country a mangrove location is in. An overview of the area is here:
but, as you zoom in, you can see that there are a tremendous number of points:
You would think that overlaying 37 million points with 400 polygons wouldn’t be too much trouble – but, it was. Big time trouble. In fact, after running for days in ArcGIS, Manifold GIS, PostGRES/PostGIS, and spatial Hadoop, it simply would not complete.
I created a spatial index, so, what was the problem? There are two main things going on:
1. The polygons had an enormous number of vertices in them – the selected polygon in t
2. Individual polygons covered a large geographic area – The imag
What was the solution? Well, knowing what I know about computational geometry, and how indexes work, I decided to decompose the polygons into convex parts – this took my 400 polygons and exploded them to over 700,000 polygons – see image on the right (Manifold GIS has a great tool that allows you to decompose geometries into Delaunay triangles or convex parts).
Dumb idea, right? Well, stay with me on this one…
By decomposing the polygons to convex parts, we achieved two things:
1. Polygons only had about 4 or 5 vertices – therefore, any polygon evaluated for a point-in-polygon overlay only required a couple of calculations.
2. The convex parts covered very small areas – with a smaller footprint, many, if not almost all the 37M points were no longer even considered, as they fell outside of the bounding box of an individual area.
So, going from 400 polygons to 700,000 polygons actually reduced the computation time, for reasons I explained above. I teach this concept in my GIS Programming class every fall, but now I have a really dramatic example to demonstrate it on. I’m probably going to create a video lecture on it, as I think it represents a “best practice” for dealing with large data.
Are you sitting down? I ran the analysis in the Radian engine, and it completed in 40 minutes. That’s days to under an hour, just by exploiting some of what we know about how point-in-polygon and spatial indexes work. I had similar results using spatial Hadoop. For the Radian engine, I issued the following SQL command:
-- $manifold$ SELECT [stubound Table].[ID_2],[A2012 Table].[pointid] INTO stucontain FROM [stubound Table],[A2012 Table] WHERE GeomContains([stubound Table].[Geom],[A2012 Table].[Geom (I)],0.001) THREADS 4;
so, the query ran using the 4 CPUs on my computer.
Oh, and did I mention I have two new books out about spatial SQL in Manifold and PostGIS?