Skip to content

Releases: bioml-tools/bio-datasets

v0.1.2

15 Nov 02:00
3bf1aeb
Compare
Choose a tag to compare

Fixes to dataset upload example scripts

v0.1.1

08 Nov 20:21
56ce7ac
Compare
Choose a tag to compare

v0.1.0 bug fixes:

  • feature encoding / decoding
  • example dataset uploading scripts

v0.1.0

08 Nov 17:08
9627d4f
Compare
Choose a tag to compare

First Bio Datasets release.

Bio Datasets is an extension to the HuggingFace Datasets library, providing support for feature types relevant to biological applications of ML. It aims to make storing, sharing and loading of large ML-ready biological datasets as efficient and easy as possible.

In the first release, support is focussed on biomolecular structures via the feature types: AtomArrayFeature, StructureFeature,
ProteinAtomArrayFeature, ProteinStructureFeature

Each feature type is stored in an efficient, modality dependent format, and loaded into a biotite Atom Array. We also offer the option to load features into custom Bio Datasets python objects representing Biomolecules. These objects: Biomolecule, BiomoleculeChain, BiomoleculeComplex, ProteinChain, ProteinComplex, DNAChain, RNAChain and SmallMolecule, are each wrappers around a biotite atom array, providing modality-specific standardisation as well as convenience methods, that together are intended to reduce the pain of preprocessing these feature types for ML applications.

The feature types are fully compatible with the HuggingFace Datasets library, meaning Bio Datasets datasets benefit from all the advantage of HuggingFace Datasets, including:

  • straightforward handling of larger than memory datasets via memory-mapping
  • straightforward sharing and downloading of datasets with Feature metadata to the HuggingFace Hub
  • support for streaming larger-than-disk datasets directly from the hub
    And many more!