Continuing on with my work in querying spatial data in Azure SQL, I got stuck on the creation of a spatial index, which has been pointed out as being essential to the performance of an application making use of any kind of spatial data. This much is obvious, since the speed of most queries made on my database of local government areas falls in the ballpark of "slow enough to make you want to scream at your phone".
Spatial indexes in Azure SQL, for whatever reason, must be made on a *clustered* primary key. In other words, the primary key of your table MUST be the geometry or geography object on which you want to create an index on. This poses a challenge in the current Azure SQL runtime as it requires a clustered index to exist at all times, and that clustered index cannot change after initialisation. This means you need to actually pull the entire database down to a local instance of SQL Server, create the index on the local instance, and then re-deploy your (potentially huge) database table to the Azure SQL instance. Amazing.
Anyway, there are guides abound that teach you how to migrate a cloud based database to your local machine and back; it's a fairly easy (if slow and mundane) process. Just remember to import to a database running MSSQL 11.x, and not 12.x since Azure SQL instances will fail to migrate to later versions of MSSQL.
Tuning SQL Spatial Queries
So it now comes down to how to write and tune your spatial query, which is a somewhat involved process. I recommend watching this video before you start thinking about writing spatial queries, or you'll get scared quite quickly and run away (like I did).
Also of interest is this massive blog post from '11 which coincidentally discusses spatial indexing with case studies of Australian LGA datasets - the same datasets I'm using!
One of the important factors in increasing efficiency of my queries is the use of a bounding box, which filters out results that literally do not fit into a specified box. It was surprisingly difficult to find the bounding box of NSW, but a rough one can be calculated from this site:
- X-min: 141
- Y-min: -37.5
- X-max: 154
- Y-max: -28
Note: these are the values that I use for my bounding box, and are a rough approximation of the actual values.
I made a reasonably effective (if basic) spatial index that decreases query times by approximately ~50% on my datasets.