All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Tokenizer.detokenize()
now truncates the output to the first stop token it finds, iftrim_stop_token=True
.
- Add stop and start tokens for
AnnotatedSpectrumDataset
, when available. - When
reverse
is used for thePeptideTokenizer
, automatically reverse the decoded peptide.
- Added support for unsigned modification masses that don't quite conform to the Proforma standard.
- The
scan_id
column for parsed spectra is not a sting instead of an integer. This is less space efficient, but we ran into issues with Sciex indexing when trying to use only an integer.
- Partially revert length changes to
SpectrumDataset
andAnnotatedSpectrumDataset
. We removed__len__
from both due to problems with PyTorch Lightning compatibility. - Simplify dataset code by removing redundancy with
lance.pytorch.LanceDatset
. - Improved warning message for skipped spectra.
- Length of the
SpectrumDataset
andAnnotatedSpectrumDataset
now reflect thesamples
parameter of thelance.pytorch.LanceDataset
parent class.
- The length of
SpectrumDataset
andAnnotatedSpectrumDataset
is now the number of batches, not the number of spectra. This let's tools like PyTorch Lighting create their progress bars properly. - Parsing a dataset now no longer requires reading essentially the whole first file. Now the schema is inferred from the first 128 spectra.
- Significant updates to documentation. Add how to model mass spectra.
- Reading and writing from cloud storage on everything!
- Migrated to Mike for mkdocs to manage multiple versions.
- Moved test GitHub Action from pip to uv.
We have completely reworked of the data module. Depthcharge now uses Apache Arrow-based formats instead of HDF5; spectra are converted either Parquet or streamed with PyArrow, optionally into Lance datasets.
We now also have full support for small molecules, with the MoleculeTokenizer
,
AnalyteTransformerEncoder
, and AnalyteTransformerDecoder
classes.
PeptideTransformer*
are nowAnalyteTransformer*
, providing full support for small molecule analytes. Additionally the interface has been completely reworked.- Mass spectrometry data parsers now function as iterators, yielding batches of spectra as
pyarrow.RecordBatch
objects. - Parsers can now be told to read arbitrary fields from their respective file formats with the
custom_fields
parameter. - The parsing functionality of
SpctrumDataset
and its subclasses have been moved to thespectra_to_*
functions in the data module. SpectrumDataset
and its subclasses now return dictionaries of data rather than a tuple of data. This allows us to incorporate arbitrary additional dataSpectrumDataset
and its subclasses are nowlance.torch.data.LanceDataset
subclasses, providing native PyTorch integration.- All dataset classes now do not have a
loader()
method.
- Support for small molecules.
- Added the
StreamingSpectrumDataset
for fast inference. - Added
spectra_to_df
,spectra_to_df
,spectra_to_stream
to thedepthcharge.data
module.
- Determining the mass spectrometry data file format is now less fragile. It now looks for known line contents, rather than relying on the extension.
- Support for fine-tuning the wavelengths used for encoding floating point numbers like m/z and intensity to the
FloatEncoder
andPeakEncoder
.
- The
tgt_mask
in thePeptideTransformerDecoder
was the incorrect type. Now it isbool
as it should be. Thanks @justin-a-sanders!
- Providing a proper tokenization class (also resolves #24 and #18)
- First-class support for ProForma peptide annotations, thanks to
spectrum_utils
andpyteomics
. - Adding primitive dataclasses for peptides, peptide ions, mass spectra ... and even small molecules 🚀
- Adding type hints to everything and stricter linting with Ruff.
- Adding a ton of tests.
- Tight integration with
spectrum_utils
💪
- Moving preprocessing onto parsing instead of data loading (similar to @bittremieux's proposal in #31)
- Combining the SpectrumIndex and SpectrumDataset classes into one.
- Changing peak encodings. Instead of encoding the intensity using a linear projection and summing with the sinusoidal m/z encodings, now the intensity is also sinusoidally encoded and is combined with the sinusoidal m/z encodings using a linear layer.
- Applied hotfix from v0.3.1
- Fixed retrieving version information.
- Change target mask from float to boolean.
- Log the number spectra that are skipped due to an invalid precursor charge.
- Dropped pytorch-lightning as a dependency.
- Removed SpectrumDataModule
- Removed full-blown models (depthcharge.models)
- Fixed sinusoidal encoders (Issue #27)
MassEncoder
is nowFloatEncoder
, because its generally useful for encoding floating-point numbers.
- pre-commit hooks and linting with Ruff.
- Tensorboard is now an optional dependency.
- The example de novo peptide sequencing model.
- The
detokenize()
method now returns a list instead of a string.
- This if the first release! All changes from this point forward will be recorded in this changelog.