Just a quick follow-up to my big data test. If you remember, I was able to determine the number of taxi pickups and the sum of the fares for each zone using Postgres and PostGIS in 1m 40s. Some of the taxi zones are a little large, so the containment query might actually take a little longer when comparing the bounding boxes in the spatial index. To get around that, I used ST_SubDivide to break the larger taxi zones into smaller polygons:
this meant that my taxi zone polygons went from 263 to 4,666. Now, on the face level, what idiot would do an overlay with 4,666 polygons when 263 is smaller – this idiot! To understand this, you should read my blog post on When More is Less, you’ll see there is good logic behind the madness. Well, anyway, that’s what I did, and we went from 1m 40s down to 1m 3s.
For those of you interested, I broke the zones up as follows:
SELECT ST_Subdivide("Geom", 50) AS geom, zone
into taxisub FROM taxizones;
CLUSTER taxisub USING geom_idx;
and yes, CLUSTER once again made a big difference.
I guess I should explain the SQL this time around, as it enables us to do some clever things. Remember, taxisub has 4,666 polygons because it has subdivided the 263 polygons in taxizones.
SELECT taxizones."Geom" AS geom, count(id) AS numrides, sumfare, a.zone
INTO sumtable
FROM taxizones,
(SELECT taxisub.zone, sum(taxisandy.fare_amount) AS sumfare
FROM taxisub
JOIN taxisandy
ON ST_Contains(geom, pu_geom)
GROUP BY zone) AS a
WHERE taxizones.zone = a.zone
In the above query, the inner portion is the straight-up SQL to determine the total rides and sum of the fares in each polygon in taxisub. However, taxisub has over 4000 polygons – we don’t want to write that out. So, the outer portion of the query is joining the original taxizones (the one with only 263 polygons), and writing it out to a final table.
Again, if you want to learn how to do more spatial SQL like this, check out my courses at http://www.gisadvisor.com.
Comments