In addition to SQL, I also am interested in processing large volumes of spatial data. One of the newest rages in “big data” is Hadoop. According to Wikipedia:
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
One way this is implemented is a programming model called MapReduce. Don’t get too excited, it doesn’t have anything to do with maps or GIS – but, it is very clever and powerful for certain types of problems. The concept is if you have a really large dataset, you divide and conquer that dataset in a number of steps. For example, say we wanted to know all the people with the name “John” in the phonebook, and say we had 26 computers in a cluster – we might solve this by:
1. Use each computer (1-26) to find all the “Johns” for the first letter in the last name (A-Z). That way, you have effectively broken the problem into 26 smaller units.
2. Once each computer has counted up the number of Johns, you have reduced the dataset (hence, MapReduce) to 26 variables.
3. Now, count up the total of the 26 variables.
That is an oversimplified version of course, but it helps to illustrate what we want to do. I understand that the University of Minnesota has created a set of functions called SpatialHadoop. I want to test this over the summer, but for now I decided to create my own poor man’s version using PostGRES.
I have 37 million points, and 240 polygons. For each point, I want to find out what polygon the point is in. Easy enough. However, the data size is prohibiting. After running the process for days in ArcGIS, Manifold, QGIS, and PostGRES, the process either did not finish, or just crashed the computer. Our traditional ways do doing this just wasn’t working.
I figured that if I could somehow replicate the MapReduce problem, we might be in business.
I decided to try and solve this problem with PostGRES and PostGIS. My initial test was to work with 1 million records. If it worked, I would try scaling it up to 37 million.
Now, PostGRES is not a multithreaded application. Meaning, if you have 8 CPUs on your computer, PostGRES is only going to run in one CPU. But, if you run 5 instances of PostGRES, each will run in its own thread on a CPU.
Therefore, I created an SQL script that performed the spatial containment query on a portion of the data and wrote the results to a table. For instance, this query will INSERT the point ID and polygon ID into a table for each point contained in the polygon:
INSERT INTO testtable SELECT points.pointid, polygons."ID_2" FROM points , polygons WHERE st_contains(polygons.geometry,points.geometry) AND points.pointid BETWEEN 15000000 and 15100000
The above query effectively performs the containment query on only 100,000 records. I then ran the query in 5 different threads, each evaluating a 100,000 chunk of points. So, the BETWEEN clause was changed to 15200001 AND 15300000, and so on. Therefore, the 5 separate threads processed all 500,000 records, 100,000 (1/5th of the data) records at a time.
I noticed a couple of interesting things: the change in speed and the utilization of the CPU.
It turns out that running multiple instances of the SQL statements completes the job faster than only running one instance. For example, running the above query on 500,000 records
INSERT INTO testtable SELECT points.pointid, polygons."ID_2" FROM points , polygons WHERE st_contains(polygons.geometry,points.geometry) AND points.pointid BETWEEN 15000000 and 15500000
completes in 365 seconds.
But, running 5 instances of the query takes 130 seconds!!!
As expected, the CPU usage when one instance of PostGRES was running showed that only a single CPU was firing at its maximum. However, when I ran the five instances, you can see that many more of the CPUs were firing.
This showed me that we can actually maximize the thread use with PostGRES for completing a computationally intensive spatial task by breaking the problem into different chunks.
The main bottleneck was inserting into the table. As you can see from the figure above, the execution time increased in each window, with a maximum of 358 seconds – I don’t believe that the data was running serially, but rather the INSERT portion of the query had to wait it’s turn to update the table.
Where to go next
I’m not ready to tackle the 37 million points just yet. I want to see if there are some ways to speed this up even more. In my next post, I will use an UPDATE statement instead of the INSERT statement to see if there are any differences in speed. After that, I want to try some other tasks, like determining how many points are in the polygons. This will allow us to run the query in five threads, obtaining 5 result tables, and then running a final query to add up the results of the five tables. The question on my mind is what scenarios give us the best opportunities to parallelize our work activities.
If this looks promising, I have 24 computers in my lab, and if I use 5 threads for each computer, that gives me 120 threads to run the query.