Skip to content

Commit

Permalink
Update Lance backend to use native PyTorch integration (#46)
Browse files Browse the repository at this point in the history
* Update tests and datasets

* Bump changelog for release

* Change version

* Add verbosity

* Fewer num_workers

* Update AnalyteDataset
  • Loading branch information
wfondrie authored Apr 17, 2024
1 parent e957f95 commit 98035ec
Show file tree
Hide file tree
Showing 7 changed files with 257 additions and 217 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:
- name: Run unit and system tests
run: |
pytest --cov=depthcharge tests/
pytest --cov=depthcharge --verbose tests/
- name: Upload coverage to codecov
uses: codecov/codecov-action@v3
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# Changelog for depthcharge
# Changelog for Depthcharge
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [v0.4.0]

We have completely reworked of the data module.
Depthcharge now uses Apache Arrow-based formats instead of HDF5; spectra are converted either Parquet or streamed with PyArrow, optionally into Lance datasets.

Expand All @@ -18,6 +20,8 @@ We now also have full support for small molecules, with the `MoleculeTokenizer`,
- Parsers can now be told to read arbitrary fields from their respective file formats with the `custom_fields` parameter.
- The parsing functionality of `SpctrumDataset` and its subclasses have been moved to the `spectra_to_*` functions in the data module.
- `SpectrumDataset` and its subclasses now return dictionaries of data rather than a tuple of data. This allows us to incorporate arbitrary additional data
- `SpectrumDataset` and its subclasses are now `lance.torch.data.LanceDataset` subclasses, providing native PyTorch integration.
- All dataset classes now do not have a `loader()` method.

### Added
- Support for small molecules.
Expand Down
22 changes: 1 addition & 21 deletions depthcharge/data/analyte_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from collections.abc import Iterable

import torch
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data import TensorDataset

from ..tokenizers import Tokenizer

Expand Down Expand Up @@ -38,23 +38,3 @@ def __init__(
def tokens(self) -> torch.Tensor:
"""The peptide sequence tokens."""
return self.tensors[0]

def loader(self, *args: tuple, **kwargs: dict) -> DataLoader:
"""A PyTorch DataLoader for peptides.
Parameters
----------
*args : tuple
Arguments passed initialize a torch.utils.data.DataLoader,
excluding ``dataset``.
**kwargs : dict
Keyword arguments passed initialize a torch.utils.data.DataLoader,
excluding ``dataset``.
Returns
-------
torch.utils.data.DataLoader
A DataLoader for the peptide.
"""
return DataLoader(self, *args, **kwargs)
Loading

0 comments on commit 98035ec

Please sign in to comment.