-
Notifications
You must be signed in to change notification settings - Fork 300
Using XArray and dask in satpy
import xarray as xr
XArray's DataArray is now the standard data structure for arrays in satpy. They allow the array to define dimensions, coordinates, and attributes (that we use for the metadata).
To create such an array, you can do for example
my_dataarray = xr.DataArray(my_data, dims=['y', 'x'],
coords={'x': np.arange(...)},
attrs={'sensor': 'olci'})
my_data
can be a regular numpy array, a numpy memmap, or, if you want to keep things lazy, a dask array (more on dask later).
In satpy, the dimensions of the arrays should include
-
x
for the x or column or pixel dimension -
y
for the y or row or line dimension -
bands
for composites -
time
can also be provided, but we have limited support for it at the moment. Use metadata for common cases (start_time
,end_time
)
Dimensions are accessible through my_dataarray.dims
. To get the size of a given dimension, use sizes
:
my_dataarray.sizes['x']
Coordinates can be defined for those dimensions when it makes sense:
-
x
andy
: they are usually defined when the data's area is an AreaDefinition, and the contain the projection coordinates in x and y. -
bands
: they contain the letter of the color they represent, eg['R', 'G', 'B']
for an RGB composite.
This allows then to select for example a single band like this:
red = my_composite.sel(bands='R')
or even multiple bands:
red_and_blue = my_composite.sel(bands=['R', 'B'])
To access the coordinates of the data array, use the following syntax:
x_coords = my_dataarray['x']
my_dataarray['y'] = np.arange(...)
Most of the time, satpy will fill the coordinates for you, so you just need to provide the dimension names.
To save metadata, we use the .attrs
dictionary.
my_dataarray.attrs['platform_name'] = 'Sentinel-3A'
Some metadata that should always be present in our dataarrays:
-
area
the area of the dataset. This should be handled in the reader. -
start_time
,end_time
sensor
DataArrays work with regular arithmetic operation as one would expect of eg numpy arrays, with the exception that using an operator on two DataArrays requires both arrays to share the same dimensions, and coordinates if those are defined.
For mathematical functions like cos or log, use the ufuncs
module:
import xarray.ufuncs as xu
cos_zen = xu.cos(zen_xarray)
Note that the xu.something
function also work on numpy arrays.
In DataArrays, masked data is represented with NaN values. Hence the default type is float64
, but float32
works also in this case. XArray can't handle masked data for integer data, but in satpy we try to use the special _FillValue
attribute (in .attrs
) to handle this case. If you come across a case where this isn't handled properly, contact us.
Masking data from a condition can be done with:
result = my_dataarray.where(my_dataarray > 5)
Result is then analogous to my_dataarray, with values lower or equal to 5 replaced by NaNs.
http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
import dask.array as da
The data part of the DataArrays we use in satpy are mostly dask Arrays. That allows lazy and chunked operations for efficient processing.
To create a dask array from a numpy array, one can call the from_array
function:
darr = da.from_array(my_numpy_array, chunks=4096)
The chunks keyword tells dask the size of a chunk of data. If the numpy array is 3-dimensional, the chunk size provide above means that one chunk will be 4096x4096x4096 elements. To prevent this, one can provide a tuple:
darr = da.from_array(my_numpy_array, chunks=(4096, 1024, 2))
meaning a chunk will be 4096x1024x2 elements in size.
Even more detailed sizes for the chunks can be provided if needed, see the dask documentation.
To avoid loading the data into memory when creating a dask array, other kinds of arrays can be passed to from_array
. For example, a numpy memmap allows dask to know where the data is, and will only be loaded when the actual values need to be computed. Another example is a hdf5 variable read with h5py.
Some procedural generation function are available in dask, eg meshgrid
or random.random
.
Regular arithmetic operations are provided, and generate another dask array.
arr1 = da.random.uniform(0, 1000, size=(1000, 1000), chunks=100)
arr2 = da.random.uniform(0, 1000, size=(1000, 1000), chunks=100)
arr1 + arr2
# -> dask.array<add, shape=(1000, 1000), dtype=float64, chunksize=(100, 100)>
In order to compute the actual data, use the compute
method:
(arr1 + arr2).compute()
->
array([[ 898.08811639, 1236.96107629, 1154.40255292, ...,
1537.50752674, 1563.89278664, 433.92598566],
[ 1657.43843608, 1063.82390257, 1265.08687916, ...,
1103.90421234, 1721.73564104, 1276.5424228 ],
[ 1620.11393216, 212.45816261, 771.99348555, ...,
1675.6561068 , 585.89123159, 935.04366354],
...,
[ 1533.93265862, 1103.33725432, 191.30794159, ...,
520.00434673, 426.49238283, 1090.61323471],
[ 816.6108554 , 1526.36292498, 412.91953023, ...,
982.71285721, 699.087645 , 1511.67447362],
[ 1354.6127365 , 1671.24591983, 1144.64848757, ...,
1247.37586051, 1656.50487092, 978.28184726]])
Dask also provides cos
, log
and other mathematical function, that you can use with da.cos
, da.log
. However, since satpy uses xarrays as standard data structure, prefer the xarray functions when possible (they call in turn the dask counterparts when possible).
Helpful functions:
map_blocks
map_overlap
atop
store
tokenize
compute
delayed
rechunk
vindex