Skip to content

Using XArray and dask in satpy

Martin Raspaud edited this page Mar 1, 2018 · 24 revisions

XArray

import xarray as xr

XArray's DataArray is now the standard data structure for arrays in satpy. They allow the array to define dimensions, coordinates, and attributes (that we use for the metadata).

To create such an array, you can do for example

my_dataarray = xr.DataArray(my_data, dims=['y', 'x'],
                            coords={'x': np.arange(...)},
                            attrs={'sensor': 'olci'})

my_data can be a regular numpy array, a numpy memmap, or, if you want to keep things lazy, a dask array (more on dask later).

Dimensions

In satpy, the dimensions of the arrays should include

  • x for the x or pixel dimension
  • y for the y or line dimension
  • bands for composites
  • time can also be provided, but we have limited support for it at the moment. Use metadata for common cases (start_time, end_time)

Dimensions are accessible through my_dataarray.dims. To get the size of a given dimension, use sizes:

my_dataarray.sizes['x']

Coordinates

Coordinates can be defined for those dimensions when it makes sense:

  • x and y: they are usually defined when the data's area is an AreaDefinition, and the contain the projection coordinates in x and y.
  • bands: they contain the letter of the color they represent, eg ['R', 'G', 'B'] for an RGB composite.

This allows then to select for example a single band like this:

red = my_composite.sel(bands='R')

or even multiple bands:

red_and_blue = my_composite.sel(bands=['R', 'B'])

To access the coordinates of the data array, use the following syntax:

x_coords = my_dataarray['x']
my_dataarray['y'] = np.arange(...)

Attributes

To save metadata, we use the .attrs dictionary.

my_dataarray.attrs['platform'] = 'Sentinel-3A'

Some metadata that should always be present in our dataarrays:

  • area the area of the dataset. This should be handled in the reader.
  • start_time, end_time
  • sensor

Operations on DataArrays

DataArrays work with regular arithmetic operation as one would expect of eg numpy arrays, with the exception that using an operator on two DataArrays requires both arrays to share the same dimensions, and coordinates if those are defined.

For mathematical functions like cos or log, use the ufuncs module:

import xarray.ufuncs as xu
cos_zen = xu.cos(zen_xarray)

Note that the xu.something function also work on numpy arrays.

Further reading

http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray

Dask

import dask.array as da

The data part of the DataArrays we use in satpy are mostly dask Arrays. That allows lazy and chunked operations for efficient processing.

To create a dask array from a numpy array, one can call the from_array function:

darr = da.from_array(my_numpy_array, chunks=4096)

The chunks keyword tells dask the size of a chunk of data. If the numpy array is 3-dimensional, the chunk size provide above means that one chunk will be 4096x4096x4096 elements. To prevent this, one can provide a tuple:

darr = da.from_array(my_numpy_array, chunks=(4096, 1024, 2))

meaning a chunk will be 4096x1024x2 elements in size.

Even more detailed sizes for the chunks can be provided if needed, see the dask documentation.

Helpful functions:

  • map_blocks
  • map_overlap
  • atop
  • store
  • tokenize
  • compute
  • delayed
  • rechunk
  • vindex