Wednesday, 30 August 2017

Simplify Spatial Release V1.0

I originally wrote the Simplify Spatial tool in 2012 after working on a large enterprise data warehouse project at a utilities provider. The company had a large number of assets with geospatial information; individual points, standard regions such as suburb/government area/city/state, and custom regions based on restrictions and distribution. There was nothing complicated about the setup, the individual assets (points) were to be tagged and grouped by the regions (polygons) to allow for simplier management and operational reporting.
I was shocked by the performance of the major RDBMS engines for such simple calculations over what were not that many rows of data. The combination of points and number of joins required to the different polygon columns resulted in a processing time >24hours for a daily refresh. Totally unacceptable but ‘as designed’ when queried with the provider of the Microsoft Fast Track architecture. The result on that site was to defer the computation to ArcGIS and simply consume the output CSV.
I spent a number of months investigating how the database engines process spatial queries and discovered a way to supercharge them by refactoring the spatial data into a format that is easier to consume by the database engine. I’ve successfully used this on a number of projects while waiting (and hoping) for the database providers to improve their solution so I can go back to using the features out of the box. Unfortunately this hasn’t happened, and when I recompiled the solution for SQL Server 2016 and tested it I found that while the base performance had improved, so had Simplify Spatial and it is is still many times faster. So I’ve decided to release it for all.
The suite of libraries is available from the Downloads tab.
What follows is an excerpt from a presentation from 2012


Results

Test CaseSQL Server (2012)PostGRESOracle
Original
Optimised
Original
Optimised
Original
Optimised
Alaska (34k points returned)
40 sec
1.2 sec
2.8 sec
0.4 sec
10.8 sec
2.2 sec
Colorado (49k points returned)
21 sec
1.8 sec
2.2 sec
0.8 sec
5 sec
2.1 sec
C% (190k points returned)
2:02 min
5.8 sec
12 sec
2.9 sec
1:10 min
6.5 sec
All (2.1m points returned)
1:14:58 hr
48 sec
5:30 min
34 sec
50:26 min
1:20 min
The performance benefits were across the board. Any situation that involved a large number of points mapping to a smaller number of very complex polygons showed performance benefits of x10-100 with no loss of accuracy. It wasn't limited to SQL Server either.
The following source data was used for the test cases:
Polygons - the 2012 TIGER/Line Shapefiles for the polygon data http://www.census.gov/cgi-bin/geo/shapefiles2012/main
Points - main geonames table for the US http://download.geonames.org/export/dump/US.zip
The test cases were performed on the following RDBMSes on a W7 x64 machine with a quad core CPU and 8gig of ram.
SQL Server 2012 (Microsoft SQL Server 2012 (SP1) - 11.0.3373.0 (X64))
PostGRES 9.2 (x64) with PostGIS 2.0 v2.0.3
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

The data was imported into the above DBs as geometry and the relevant spatial indexes applied.
I ran the polygon splitting function across the US States table on the SQL Server instance and persisted it in a new physical table. This table was left as an unindexed heap. I then exported this table to the Oracle and PostGRES databases.
The tests involve running a set of simple intersection queries between the geonames US landmark point data and both the original indexed US States table and the simplified polygons in the heap.
All queries were of the form
SELECT count(*)
FROM <US_state_table> t1
INNER JOIN <geonames_landmark_table> t2
ON <Intersection between t2.geom and t1.geom is true>

This limited the test to the internal processing of the spatial join and removed any overhead from parsing/displaying large amounts of data in the query tool.

Test Cases

The geonames landmark data was joined against the following sets of polygon data.

Alaska

14033 vertices. Multi-polygon with 50 individual geometries.
Purpose: Test performance of MultiPolygon geometry with complex shape and large gaps between individual geometries.

Figure 4 Original

Figure 5 Simplified. 60 individual polygons.

Colorado

8110 vertices. Single polygon
Purpose: Test performance of single geometry rectangular polygon

Figure 6 Original

Figure 7 Simpified. 16 individual polygons

C% Filter

Includes Colorado, California, Connecticut and Commonwealth of Northern Mariana Islands (not shown, too far left).
Vertex counts range from 818 to 11131 for each of the geometry objects.
Purpose: Test performance of over a small set of simple and complex polygons.

Figure 8 Original

Figure 9 Simplified. 51 individual polygons

All States

56 geometries ranging from 400 to 44000 vertices. Combination of single and multi-polygon geometries.
Purpose: Large scale test

Figure 10 Original

Figure 11 Simplified. 2226 individual polygons.

Query Times

Test Case SQL Server (2012) PostGRES Oracle
Original
Optimised
Original
Optimised
Original
Optimised
Alaska (34k points returned)
40 sec
1.2 sec
2.8 sec
0.4 sec
10.8 sec
2.2 sec
Colorado (49k points returned)
21 sec
1.8 sec
2.2 sec
0.8 sec
5 sec
2.1 sec
C% (190k points returned)
2:02 min
5.8 sec
12 sec
2.9 sec
1:10 min
6.5 sec
All (2.1m points returned)
1:14:58 hr
48 sec
5:30 min
34 sec
50:26 min
1:20 min
This is not meant as a comparison between the different RDBMS spatial offerings, but a comparison of performance between the original polygons and the simplified versions on each RDBMS
Every query I ran against the simplified polygons was magnitudes faster than and just as accurate as the resultset against the original state geometry! The returns diminished when the number of individual polygon slices approached the number of geometry points.

SQL Server 2016 (updated)

The above test cases were re-run on SQL Server 2016 (SP1 CU3) running on a D12_V2 VM on azure (4 core 28GB RAM).
Test CaseSQL Server (2016)
Original(1st run)
Optimised (1st run)
Original (2nd+ run)
Optimised (2nd + run)
Alaska (34k points returned)
21 sec
3.2 sec
13.9 sec
1 sec
Colorado (49k points returned)
13.6 sec
9 sec
9.3 sec
0.8 sec
C% (190k points returned)
45.3 sec
7.6 sec
38.3 sec
2.9 sec
All (2.1m points returned)
8:27 min
27.1 sec
8:24 min
16 sec
The improvements in SQL Server 2016 spatial have led to performance increases in both the vanilla engine that are siginificant over the original timings, but those improvements have also led to the Simplify Spatial tools running faster as well!. The Simplify Spatial tools scale in the ‘good’ way, meaning the more complex (and slow) the query is on the vanilla engine, the greater % improvement when using the Simplify Spatial tools.