Releases: bioml-tools/bio-datasets
v0.1.2
v0.1.1
v0.1.0
First Bio Datasets release.
Bio Datasets is an extension to the HuggingFace Datasets library, providing support for feature types relevant to biological applications of ML. It aims to make storing, sharing and loading of large ML-ready biological datasets as efficient and easy as possible.
In the first release, support is focussed on biomolecular structures via the feature types: AtomArrayFeature
, StructureFeature
,
ProteinAtomArrayFeature
, ProteinStructureFeature
Each feature type is stored in an efficient, modality dependent format, and loaded into a biotite Atom Array. We also offer the option to load features into custom Bio Datasets python objects representing Biomolecules. These objects: Biomolecule
, BiomoleculeChain
, BiomoleculeComplex
, ProteinChain
, ProteinComplex
, DNAChain
, RNAChain
and SmallMolecule
, are each wrappers around a biotite atom array, providing modality-specific standardisation as well as convenience methods, that together are intended to reduce the pain of preprocessing these feature types for ML applications.
The feature types are fully compatible with the HuggingFace Datasets library, meaning Bio Datasets datasets benefit from all the advantage of HuggingFace Datasets, including:
- straightforward handling of larger than memory datasets via memory-mapping
- straightforward sharing and downloading of datasets with Feature metadata to the HuggingFace Hub
- support for streaming larger-than-disk datasets directly from the hub
And many more!