Skip to content

Commit

Permalink
geojson rant
Browse files Browse the repository at this point in the history
  • Loading branch information
Ian Turton committed Nov 11, 2023
1 parent 1f1d381 commit e80c764
Show file tree
Hide file tree
Showing 2 changed files with 98 additions and 0 deletions.
98 changes: 98 additions & 0 deletions _posts/2023-11-11-geojson.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
layout: post
title: Is GeoJSON a spatial data format?
date: 2023-11-11
categories: gis
---
# Is GeoJSON a good spatial data format?

A few days ago on Mastodon [Eli Pousson](https://fosstodon.org/@[email protected])
asked:

> Can anyone suggest examples of files that can contain location info but aren't often considered spatial data
> file formats?
>
He suggested EXIF, [Iván Sánchez Ortega](@[email protected] )
followed up with spreadsheets, and being devilish I said GeoJSON.

This led to more discussion, with people asking why I thought that, so I instead of being flippant I thought
about it. This blog post is the result of those thoughts which I thought were kind of obvious but from things
people have said since may be aren't that obvious.

I've mostly been a developer for most of my career so my main interest in a spatial data format is that:

1. it stores my spatial data as I want it to,
2. it's fast to read and to a lesser extent, write.
3. It's easy to manage.

One, seems to be obvious, if I store a point then ask for it back I want to get that point back (to the limit
of the precision of the processor's floating point). If a format can't manage that then please don't use it.
This is not common but Excel comes to mind as a program that takes good data and trashes it. If it isn't
changing [gene names into
dates](https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates) then
it's [reordering the dbf file to destroy your
shapefile](https://gis.stackexchange.com/questions/132359/how-is-attribute-data-in-dbf-file-tied-to-shapefile-location-data-in-shp-file).
GeoJSON also can fail at this as the standard says that I must store the data in WGS:84 (lon/lat), which is
fine if that is the format that I store my data in already, but suppose I have some high quality OSGB data
that is carefully surveyed to fractions of a millimetre and the underlying code does a conversion to WGS:84 in
the background and further the developer wanted to save space and limited the number of decimal places to say
6 (OK, [that was me](https://osgeo-org.atlassian.net/browse/GEOT-6650)) when it gets converted back to OSGB
I'm looking at centimetres (or worse) but given the vagaries of floating point representation I may not be
able to tell.

Two, comes from being a GeoServer developer, a largish chunk of the time taken to draw a web map (or stream
out a WFS file) is taken up by reading the data from the disk. Much of the rest of the time is converting the
data into a form that we can draw. Ideally, we only want to read in the features needed for the map the user
has requested (actually, ideally we want to **not** read in most of the data by having it already be in the
cache, but that is hard to do). So we like indexed datasets both spatial indexes and attribute indexes can
help substantially speed up map drawing. As the size of spatial datasets increases the time taken to fetch the
next feature from the store becomes more and more important. An index allows the program to skip to the
correct place in the file for either a specific feature or for features that are in a specific place or
contain a certain attribute with the requested value. This is a great time saver, imagine trying to look
something up in a big book by using the index compared to paging through it reading each page in turn.

After one or more indexes the main thing I look for in a format is a binary format that is easy to read (and
write). GeoJSON (and GML) are both problematic here as they are text formats (which is great in a transfer
format) and so for every coordinate of every spatial object the computer has to read in a series of digits
(and punctuation) and convert that into an actual binary number that it can understand. This is a slow
operation (by computer speeds anyway) and if I have a couple of million points in my coastline file then I
don't want to do 4 million slow operations before I even think of drawing something.

Three, I have to interact with users on a fairly regular basis and in a lot of cases these are not spatial
data experts. If a format comes with up to a dozen similarly named files (that are all important) that a GIS
will refuse to process unless you guess which is the important one then it is more of a pain than a help. And
yes shapefile I'm looking at you. If your process still makes use of Shapefiles please, please stop doing that
to your users (and the support team) and switch over to GeoPackages which can store hundreds of data sets
inside a single file, All good GIS products can process them by now, they have been an OGC standard for nearly
10 years. If you don't think that shapefiles are confusing go and ask your support team how often they have
been sent just the `.shp` file (or 11 files but not the `.sbn`) or how often they have seen people who have
deleted all the none `.shp` files to save disk space.

My other objection to GeoJSON is that I don't know what the structure (or schema) of the data set is until I
have read the entire file. That last record could add several bonus attributes, in fact any (or all) of the
records could do that, from a parsers view it is a nightmare. At least GML provides me with a fixed schema and
enforces it through out the file.

When I'm storing data (as opposed to transferring it) I use PostGIS, it's fast and accurate, can store my data
in whatever projection I chose and is capable of interfacing with any GIS program I am likely to use, and if
I'm writing new code then it provides good, well tested libraries in all the languages I care about so I don't
have to get into the weeds of parsing binary formats. If I fetch a feature from PostGIS it will have exactly
the attributes I was expecting no more or less. It has good indexes and a nifty DSL (SQL) that I can use to
express my queries that get dealt with by a cool query optimiser that knows way more than I do about how to
access data in the database.

If for some reason I need to access my data while I'm travelling or share it with a colleague then I will use
a GeoPackage which is a neat little database all packaged up in a single file. It's not a quick as PostGIS so
I wouldn't use it for millions of records but for most day to day GIS data sets it's great. You can even store
you QGIS styles and project in it to make it a single file project transfer format.

One final point, I sometimes see people preaching that we should go cloud native (and often serverless) by
embracing "modern" standards like GeoJSON and COGs. GeoJSON should never be used as a cloud native storage
option (unless it's so small you can read it once and cache it in memory in which case why are you using the
cloud) as it is large (yes, I know it compresses well) and slow to parse (and slower still if you compressed
it first) and can't be indexed. So that means you have to copy the whole file from a disk on the far side of a
slow internet connection. I don't care if you have fibre to the door it is still slow compared to the disk in
your machine!

![The Jack Sparrow worst pirate meme but for GeoJSON](/images/geojson.jpg )
Binary file added images/geojson.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e80c764

Please sign in to comment.