Convenience functionality for reading from and writing to AnnData files with ImgLib2.
By using the N5 API, the AnnData file can be stored as HDF5 (.h5ad
), Zarr (.zarr
), or in an N5 directory structure (.n5ad
).
Note that only HDF5 and Zarr are supported from the Python side, so far.
Reading and writing are done via the N5 API.
The central functionality is provided by the AnnDataUtils
class, which contains static methods for reading and writing AnnData files.
The reading methods in this package generally require an N5Reader
instance that points to an existing AnnData container and a path to the group within the container.
The data is read into an Img<T>
in case of numerical data, or a List<String>
in case of string data.
// open an existing AnnData container, e.g. in HDF5 format
final N5Reader reader = new N5HDF5Reader("path/to/file.h5ad");
// read the central array 'X', the datatype has to be known (here, it's assumed to be float32)
final Img<FloatType> data = AnnDataUtils.readNumericalArray(reader, "X");
// read a string array from a sub-group of 'uns'
final List<String> data = AnnDataUtils.readStringArray(reader, "uns/mapping1/mapping2/some_string_array");
// get index of a dataset (e.g., 'obs')
final List<String> data = AnnDataUtils.readDataFrameIndex(reader, "obs");
// get number of variables
final long nVar = AnnDataUtils.getNVar(reader);
In addition to an N5Writer
instance and the path to the group within the container, the writing methods require a dataset (either RandomAccessibleInterval<T>
or List<String>
), some options for N5 (compression, block size, etc.) and sometimes an AnnDataFieldType
that specifies how the data should be stored in the AnnData container.
// create a new AnnData container in Zarr format
final N5Writer writer = new N5ZarrWriter("path/to/file.zarr");
// initialize N5 options: GZIP compression, block size 4096 or 512x512, no parallel writing
final N5Options options1D = new N5Options(new int[]{4096}, new GzipCompression());
final N5Options options2D = new N5Options(new int[]{512, 512}, new GzipCompression());
// initialize an AnnData container with given obs_names and var_names
AnnDataUtils.initializeAnnData(obs_names, var_names, writer, options1D);
// write a numerical array to a 'layer'
AnnDataUtils.writeNumericalArray(data, writer, "layer/log_scale",options2D, AnnDataFieldType.DENSE_ARRAY);
// create a dataframe in 'varm' with an index (List<String>) and write a categorical array to it
AnnDataUtils.createDataFrame(index, writer, "varm/metadata", options1D);
AnnDataUtils.writeStringArray(data, writer, "varm/metadata/barcodes", options1D, AnnDataFieldType.CATEGORICAL_ARRAY);
// finalize the AnnData container; this is needed for the container to be readable from Python
AnnDataUtils.finalizeAnnData(writer);
There are some details to be aware of when using this package more extensively.
AnnData only permits a limited set of paths directly under the root group (obs
, var
, obsm
, varm
, obsp
, varp
, uns
, X
, layers
).
This package uses the enum class AnnDataPath
to define valid paths.
Such a path consists of a field (e.g., varm
) and a sub-path (e.g., metadata/barcodes
).
To create a valid path, use the constructor new AnnDataPath(AnnDataField.VARM, "metadata/barcodes")
, or the static method AnndataPath.fromString("varm/metadata/barcodes")
, which throws an exception if the path is invalid.
This package only supports a subset of the data types written by AnnData.
In particular, scalars, nullable integers and booleans, and awkward arrays are not supported.
The following table shows the supported data types and how they are represented in Java.
Note that the classes for sparse arrays derive from Img<T>
and have a generic parameter I
for the type of the index.
Conversion for sparse matrices is not done automatically, but can be done using the SparseArray.convertToSparse
static method.
For categorical arrays, conversion is done automatically.
AnnData data type | Java data type | AnnDataFieldType |
---|---|---|
array |
Img<T> |
DENSE_ARRAY |
csr_matrix |
CsrMatrix<T,I> |
CSR_MATRIX |
csc_matrix |
CscMatrix<T,I> |
CSC_MATRIX |
dataframe |
columns treated separately | DATA_FRAME |
dict |
group in a container | MAPPING |
categorical |
List<T> |
CATEGORICAL_ARRAY |
string-array |
List<T> |
STRING_ARRAY |
AnnData imposes some constraints on the data that is stored in the container.
These can be based on the field type (e.g., only dense
, csr_matrix
, or csc_matrix
can be stored in X
) or on the dimension (e.g., only arrays of size n_obs x m
, where obsm
).
A special case are data frames, which don't permit sparse arrays and only allow for arrays of size n x 1
, where n
must be uniform over the whole dataframe.
The constraints are enforce while writing by a Checker
, which can be set via AnnDataUtils.setChecker
.
Possible checkers are Checker.STRICT
(default), Checker.ONLY_TYPE
, Checker.ONLY_DIMENSION
, and Checker.NONE
.
Please note that only data that is written using Checker.STRICT
is guaranteed to be readable by Python.
While data is mostly compatible with the latest version of AnnData from Python, there are some caveats, mostly related to HDF5.
JHDF5 (which is used by N5-HDF5 under the hood) creates a housekeeping group /__DATA_TYPES__
in the root group of the HDF5 file.
When this group is present, AnnData will throw an error when trying to read the file.
To avoid this, the group is deleted when finalizing the AnnData container.
It seems that h5py
reads most strings (i.e., attribute values that are strings as well as string datasets) written by JHDF5 as bytes
instead of str
.
This might need some manual conversion using .decode("utf-8")
when reading the file from the Python side.