This repository contains a Python suite for modeling and disaggregation of spatial variables. The suite is meant to interact with a PostgreSQL/PostGIS database in which it stores and modifies its data. The workflow is done through a series of scripts controlled and run through a command line.
The suite requires Python 3.6+ with the packages listed in requirements.txt
;
you can install those using
pip install -r requirements.txt
Apart from that, the suite requires a working connection to a PostgreSQL 10+ database with a PostGIS extension with details specified in a JSON file like this:
{
"dbname" : "rum",
"host" : "localhost",
"port" : 5432,
"user" : "username",
"password" : "password"
}
The scripts look for this file in config/dbconn.json
by default, but you can
specify any other location by the -d
command line option.
This lists all the scripts in the approximate order they should be run during
an analysis. For more details on their functions, refer to their docstrings in
the .py
files.
Each analysis area should be contained in its own schema within the database. The schema name must be passed to all subsequent tools.
init.py
- initializes an analysis schema with a given name, creating it if necessarycreate_extent.py
- creates anextent
table from a given database table by unioning all its geometries together (this only makes sense if at least one layer has been imported)create_grid.py
- creates a grid layer to model the spatial indicators on with a given size, covering the area of theextent
table
These scripts import data into the analysis schema, possibly performing CRS transformation. If not explicitly instructed, they transform the CRS to match the extent table.
import_layer.py
- imports a GDAL/OGR/Fiona-compatible vector geodata file into the analysis schema, possibly performing CRS transformationimport_multi_layer.py
- imports multiple geodata files into a single tableimport_osm.py
- imports an OSM (raw or BZ2-compressed) file as a set of tablesimport_raster.py
- imports a raster file as a polygon layer, with a polygon for each raster cellimport_table.py
- imports a nonspatial table
These utilities perform some common tasks on the data tables that are cumbersome to perform manually.
recategorize.py
- perform a JSON LUT-based recategorization of a given data column, creating a new columndissolve.py
- dissolve a polygon table based on a given fieldcalculate_shape.py
- calculate shape measure columns such as perimeter index for a polygon data layer
Calculates data-based features for the analysis grid, creating feat_
tables
in the analysis schema. For performance reasons, the features are kept separate
from the grid table identified by the common geohash
column and must be
consolidated into a single table later before modeling.
calculate_features.py
- calculate features by overlaying a given data layer over the analysis grid and applying a given methodcalculate_features_by_file.py
- calculate multiple sets of features based on JSON configcalculate_target.py
- calculate a target variable for the grid squares (e.g. by aggregating points with values)calculate_neighbourhood.py
- calculate neighbourhood features from already existing featurescreate_condition.py
- use an SQL expression to create a condition on which grid squares should be passed in during modelinglist_features.py
- list all features computed in the given schemaclear_features.py
- clear all feature tables from the given schema
consolidate_features.py
- consolidate all feature tables in the schema into a single feature table:all_feats
train_model.py
- train a machine learning model to estimate a given target field value from the corresponding consolidated feature values and save it (gzip-pickle it) to a model filetrain_model_array.py
- train a set of machine learning models using different algorithms and save them to a folderintrospect_model.py
- show the internals of the trained model (feature coefficients or importances depending on the model type)merge_training_schemas.py
- merge two schemas with their grid, feature tables and any specified target tables into one to enable common training on multiple areas
apply_model.py
- estimate target values for each grid square by applying a given pretrained machine learning model for the consolidated features, storing them into a new tableapply_model_array.py
- apply multiple models on the same featurescalibrate.py
- calibrate outputs to match a given field (e.g. by multiplication so that the column sums are equal) - can also be applied after disaggregation
disaggregate.py
- disaggregate values of a given source layer to the analysis grid using a given estimated weighting field and create a new tabledisaggregate_batch.py
- use multiple weighting fields at once to obtain multiple disaggregated values for each grid square in a single tabledisaggregate_raw.py
- perform disaggregation using a different weighting layer than the grid
validate.py
- report regression accuracy of an estimated/disaggregated value for the analysis grid as compared to a given ground truth valuevalidate_array.py
- report the accuracies of multiple estimated/disaggregated values for the analysis grid from a single tablevalidate_multiscale.py
- report the accuracies for the analysis grid also on higher areal aggregation levels, constructing a multiscale accuracy profilevalidate_raw.py
- report the accuracies for an estimate/disaggregate expressed for a different spatial support than the analysis grid
A typical workflow for disaggregation would be along these lines (this is actually the workflow used in the article cited below):
init.py
to create and initialize the schemaimport_layer.py
to import a land cover shapefile (Urban Atlas)create_extent.py
to create an extent polygon based on the land cover areacreate_grid.py
to create a grid to disaggregate torecategorize.py
to aggregate the land cover classes into a less granular classification to ease the modelingimport_osm.py
to import OSM data for the areacalculate_shape.py
to calculate shape indices for the OSM building layerimport_raster.py
to import an SRTM rasterimport_layer.py
to import an address point layer with population counts for a part of the study areacalculate_features_by_file.py
to calculate features from the imported datacalculate_target.py
to calculate the target variable (population count) for the grid squares to avoid excessive error annd reduce run timecreate_condition.py
to restrict the modeling to grid squares with nonzero urban (built up) land cover fractionconsolidate_features.py
to combine all the computed features and the modeling conditiontrain_model.py
to train the machine learning model to estimate the population countsintrospect_model.py
to see the internals of the modelapply_model.py
to estimate the rough population counts for the whole study areaimport_layer.py
to import a municipality polygon layer with population counts for the whole study areadisaggregate.py
to use the rough population count estimates as weights to disaggregate the municipality population counts to grid squaresvalidate.py
to compare the disaggregated population counts to ground truth from the address points
Please cite the following article:
Šimbera, Jan (2019): Neighborhood features in geospatial machine learning: the case of spatial disaggregation. Cartography and Geographic Information Science, 2019. https://doi.org/10.1080/15230406.2019.1618201