title | tags | authors | affiliations | date | bibliography | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Museo ToolBox: A Python library for remote sensing including a new way to handle rasters. |
|
|
|
13 December 2019 |
paper.bib |
Museo ToolBox
is a Python library dedicated to the processing of georeferenced arrays, also known as rasters or images in remote sensing.
In this domain, classifying land cover type is a common and sometimes complex task, regardless of your level of expertise. Recurring procedures such as extracting Regions Of Interest (ROIs, or raster values from a polygon), computing spectral indices or validating a model with a cross-validation can be difficult to implement.
Museo ToolBox
aims at simplifying the whole process by making the main treatments more accessible (extracting of ROIs, fitting a model with cross-validation, computing Normalized Difference Vegetation Index (NDVI) or various spectral indices, performing any kind of array function to the raster, etc).
The main objective of this library is to facilitate the transposition of array-like functions into an image and to promote good practices in machine learning.
To make Museo ToolBox
easier to get started with, a full documentation with lot of examples is available online on read the docs.
Museo ToolBox
is organized into several modules (Figure 1):
- processing: raster and vector processing.
- cross-validation: stratified cross-validation compatible with scikit-learn.
- ai: artificial intelligence module built upon scikit-learn [@scikitlearn_2011].
- charts: plot confusion matrix with F1 score or producer/user's accuracy.
- stats: compute statistics (such as Moran's Index [@moran_notes_1950], confusion matrix, commision/omission) or extracting truth and predicted label from a confusion matrix.
The main usages of Museo ToolBox
are:
- Reading and writing a raster block per block using your own function.
- Generating cross-validation, including spatial cross-validation.
- Fitting models with scikit-learn, extracting accuracy from each cross-validation fold, and predicting raster.
- Plotting confusion matrix and adding f1 score or producer/user accuracy.
- Getting the y_true and and y_predicted labels from a confusion matrix.
Available in museotoolbox.processing
, the RasterMath
class is the keystone of Museo ToolBox
.
The question I asked myself is: How can we make it as easy as possible to implement array-like functions on images? The idea behind RasterMath
is that if the function is intended to operate with an array, it should be easy to use it with your raster using as few lines as possible.
So, what does RasterMath
really do? The user only works with an array and confirms with a sample that the process is doing well, and lets RasterMath
generalize it to the whole image. The user doesn't need to manage the raster reading and writing process, the no-data management, the compression, the number of bands, or the projection. Figure 2 describes how RasterMath
reads a raster, performs the function, and writes it to a new raster.
The objective is to allow the user to focus solely on the array-compatible function while RasterMath
manages the raster part.
See RasterMath
documentation and examples.
The artificial intelligence (ai
) module is natively built to implement scikit-learn
algorithms and uses state of the art methods (such as standardizing the input data). SuperLearner
class optimizes the fit process using a grid search to fix the parameters of the classifier. There is also a Sequential Feature Selection protocol which supports a number of components (e.g. a single-date image is composed of four bands, i.e. four features, so a user may select four features at once).
See the SuperLearner
documentation and examples.
Museo ToolBox
implements stratified cross-validation, which means the separation between the training and the validation samples is made by respecting the size per class.
For example the Leave-One-Out method will keep one sample of validation per class. As stated by @olofsson_good_2014 "stratified random sampling is a practical design that satisfies the
basic accuracy assessment objectives and most of the desirable design
criteria". For spatial cross-validation, see @karasiak_2019 inspired by @roberts_2017.
Museo ToolBox
offers two different kinds of cross-validation:
- Leave-One-Out.
- Leave-One-SubGroup-Out.
- Leave-P-SubGroup-Out (Percentage of subgroup per class).
- Random Stratified K-Fold.
- Spatial Leave-One-Out [@karasiak_2019].
- Spatial Leave-Aside-Out.
- Spatial Leave-One-SubGroup-Out (using centroids to select one subgroup and remove other subgroups for the same class inside a specified distance buffer).
See the cross-validation documentation and examples.
I acknowledge contributions from Mathieu Fauvel, beta-testers (hey Yousra Hamrouni), and my thesis advisors: Jean-François Dejoux, Claude Monteil and David Sheeren. Many thanks to Marie for proofreading.
Many thanks to Sigma students: Hélène Ternisien de Boiville, Arthur Duflos, Sam Antonetti and Anne-Sophie Tronc for their involvement in RasterMath
improvements in early 2020.