geojson rant

ianturton · Nov 11, 2023 · e80c764 · e80c764
1 parent 1f1d381
commit e80c764
Show file tree

Hide file tree

Showing 2 changed files with 98 additions and 0 deletions.
diff --git a/_posts/2023-11-11-geojson.md b/_posts/2023-11-11-geojson.md
@@ -0,0 +1,98 @@
+---
+layout: post
+title: Is GeoJSON a spatial data format?
+date: 2023-11-11
+categories: gis
+---
+# Is GeoJSON a good spatial data format?
+
+A few days ago on Mastodon [Eli Pousson](https://fosstodon.org/@[email protected])
+asked:
+
+> Can anyone suggest examples of files that can contain location info but aren't often considered spatial data 
+> file formats?
+>
+
+He suggested EXIF, [Iván Sánchez Ortega](@[email protected] )
+followed up with spreadsheets, and being devilish I said GeoJSON.
+
+This led to more discussion, with people asking why I thought that, so I instead of being flippant I thought 
+about it. This blog post is the result of those thoughts which I thought were kind of obvious but from things 
+people have said since may be aren't that obvious.
+
+I've mostly been a developer for most of my career so my main interest in a spatial data format is that:
+
+1. it stores my spatial data as I want it to,
+2. it's fast to read and to a lesser extent, write.
+3. It's easy to manage.
+
+One, seems to be obvious, if I store a point then ask for it back I want to get that point back (to the limit 
+of the precision of the processor's floating point). If a format can't manage that then please don't use it. 
+This is not common but Excel comes to mind as a program that takes good data and trashes it. If it isn't 
+changing [gene names into 
+dates](https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates) then 
+it's [reordering the dbf file to destroy your 
+shapefile](https://gis.stackexchange.com/questions/132359/how-is-attribute-data-in-dbf-file-tied-to-shapefile-location-data-in-shp-file). 
+GeoJSON also can fail at this as the standard says that I must store the data in WGS:84 (lon/lat), which is 
+fine if that is the format that I store my data in already, but suppose I have some high quality OSGB data 
+that is carefully surveyed to fractions of a millimetre and the underlying code does a conversion to WGS:84 in 
+the background and further the developer wanted to save space and limited the number of decimal places to say 
+6 (OK, [that was me](https://osgeo-org.atlassian.net/browse/GEOT-6650)) when it gets converted back to OSGB 
+I'm looking at centimetres (or worse) but given the vagaries of floating point representation I may not be 
+able to tell. 
+
+Two, comes from being a GeoServer developer, a largish chunk of the time taken to draw a web map (or stream 
+out a WFS file) is taken up by reading the data from the disk. Much of the rest of the time is converting the 
+data into a form that we can draw. Ideally, we only want to read in the features needed for the map the user 
+has requested (actually, ideally we want to **not** read in most of the data by having it already be in the 
+cache, but that is hard to do). So we like indexed datasets both spatial indexes and attribute indexes can 
+help substantially speed up map drawing. As the size of spatial datasets increases the time taken to fetch the 
+next feature from the store becomes more and more important. An index allows the program to skip to the 
+correct place in the file for either a specific feature or for features that are in a specific place or 
+contain a certain attribute with the requested value. This is a great time saver, imagine trying to look 
+something up in a big book by using the index compared to paging through it reading each page in turn.
+
+After one or more indexes the main thing I look for in a format is a binary format that is easy to read (and 
+write). GeoJSON (and GML) are both problematic here as they are text formats (which is great in a transfer 
+format) and so for every coordinate of every spatial object the computer has to read in a series of digits 
+(and punctuation) and convert that into an actual binary number that it can understand. This is a slow 
+operation (by computer speeds anyway) and if I have a couple of million points in my coastline file then I 
+don't want to do 4 million slow operations before I even think of drawing something. 
+
+Three, I have to interact with users on a fairly regular basis and in a lot of cases these are not spatial 
+data experts. If a format comes with up to a dozen similarly named files (that are all important) that a GIS 
+will refuse to process unless you guess which is the important one then it is more of a pain than a help. And 
+yes shapefile I'm looking at you. If your process still makes use of Shapefiles please, please stop doing that 
+to your users (and the support team) and switch over to GeoPackages which can store hundreds of data sets 
+inside a single file, All good GIS products can process them by now, they have been an OGC standard for nearly 
+10 years. If you don't think that shapefiles are confusing go and ask your support team how often they have 
+been sent just the `.shp` file (or 11 files but not the `.sbn`) or how often they have seen people who have 
+deleted all the none `.shp` files to save disk space. 
+
+My other objection to GeoJSON is that I don't know what the structure (or schema) of the data set is until I 
+have read the entire file. That last record could add several bonus attributes, in fact any (or all) of the 
+records could do that, from a parsers view it is a nightmare. At least GML provides me with a fixed schema and 
+enforces it through out the file.
+
+When I'm storing data (as opposed to transferring it) I use PostGIS, it's fast and accurate, can store my data 
+in whatever projection I chose and is capable of interfacing with any GIS program I am likely to use, and if 
+I'm writing new code then it provides good, well tested libraries in all the languages I care about so I don't 
+have to get into the weeds of parsing binary formats. If I fetch a feature from PostGIS it will have exactly 
+the attributes I was expecting no more or less. It has good indexes and a nifty DSL (SQL) that I can use to 
+express my queries that get dealt with by a cool query optimiser that knows way more than I do about how to 
+access data in the database. 
+
+If for some reason I need to access my data while I'm travelling or share it with a colleague then I will use 
+a GeoPackage which is a neat little database all packaged up in a single file. It's not a quick as PostGIS so 
+I wouldn't use it for millions of records but for most day to day GIS data sets it's great. You can even store 
+you QGIS styles and project in it to make it a single file project transfer format. 
+
+One final point, I sometimes see people preaching that we should go cloud native (and often serverless) by 
+embracing "modern" standards like GeoJSON and COGs. GeoJSON should never be used as a cloud native storage 
+option (unless it's so small you can read it once and cache it in memory in which case why are you using the 
+cloud) as it is large (yes, I know it compresses well) and slow to parse (and slower still if you compressed 
+it first) and can't be indexed. So that means you have to copy the whole file from a disk on the far side of a 
+slow internet connection. I don't care if you have fibre to the door it is still slow compared to the disk in 
+your machine! 
+
+![The Jack Sparrow worst pirate meme but for GeoJSON](/images/geojson.jpg )
diff --git a/images/geojson.jpg b/images/geojson.jpg