Update Lance backend to use native PyTorch integration (#46)

* Update tests and datasets * Bump changelog for release * Change version * Add verbosity * Fewer num_workers * Update AnalyteDataset
wfondrie · Apr 17, 2024 · 98035ec · 98035ec
1 parent e957f95
commit 98035ec
Show file tree

Hide file tree

Showing 7 changed files with 257 additions and 217 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -28,7 +28,7 @@ jobs:
 
     - name: Run unit and system tests
       run: |
-        pytest --cov=depthcharge tests/
+        pytest --cov=depthcharge --verbose tests/
 
     - name: Upload coverage to codecov
       uses: codecov/codecov-action@v3

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,11 +1,13 @@
-# Changelog for depthcharge
+# Changelog for Depthcharge
 All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 ## [Unreleased]
 
+## [v0.4.0]
+
 We have completely reworked of the data module.
 Depthcharge now uses Apache Arrow-based formats instead of HDF5; spectra are converted either Parquet or streamed with PyArrow, optionally into Lance datasets.
 
@@ -18,6 +20,8 @@ We now also have full support for small molecules, with the `MoleculeTokenizer`,
 - Parsers can now be told to read arbitrary fields from their respective file formats with the `custom_fields` parameter.
 - The parsing functionality of `SpctrumDataset` and its subclasses have been moved to the `spectra_to_*` functions in the data module.
 - `SpectrumDataset` and its subclasses now return dictionaries of data rather than a tuple of data. This allows us to incorporate arbitrary additional data
+- `SpectrumDataset` and its subclasses are now `lance.torch.data.LanceDataset` subclasses, providing native PyTorch integration.
+- All dataset classes now do not have a `loader()` method.
 
 ### Added
 - Support for small molecules.

diff --git a/depthcharge/data/analyte_datasets.py b/depthcharge/data/analyte_datasets.py
@@ -3,7 +3,7 @@
 from collections.abc import Iterable
 
 import torch
-from torch.utils.data import DataLoader, TensorDataset
+from torch.utils.data import TensorDataset
 
 from ..tokenizers import Tokenizer
 
@@ -38,23 +38,3 @@ def __init__(
     def tokens(self) -> torch.Tensor:
         """The peptide sequence tokens."""
         return self.tensors[0]
-
-    def loader(self, *args: tuple, **kwargs: dict) -> DataLoader:
-        """A PyTorch DataLoader for peptides.
-
-        Parameters
-        ----------
-        *args : tuple
-            Arguments passed initialize a torch.utils.data.DataLoader,
-            excluding ``dataset``.
-        **kwargs : dict
-            Keyword arguments passed initialize a torch.utils.data.DataLoader,
-            excluding ``dataset``.
-
-        Returns
-        -------
-        torch.utils.data.DataLoader
-            A DataLoader for the peptide.
-
-        """
-        return DataLoader(self, *args, **kwargs)