Merge branch 'main' into peak-performance-paper

JuBiotech · Oct 12, 2024 · 5a206ef · 5a206ef
2 parents e773a3e + ebcf131
commit 5a206ef
Show file tree

Hide file tree

Showing 22 changed files with 2,454 additions and 29 deletions.
diff --git a/.github/workflows/cla.yml b/.github/workflows/cla.yml
@@ -18,7 +18,7 @@ jobs:
     steps:
       - name: "CLA Assistant"
         if: (github.event.comment.body == 'recheck' || contains(github.event.comment.body, 'I have read the CLA Document and I hereby sign the CLA')) || github.event_name == 'pull_request_target'
-        uses: contributor-assistant/github-action@v2.4.0
+        uses: contributor-assistant/github-action@v2.6.1
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
           # The below token has repo scope for the project configured further below to store signatures.

diff --git a/README.md b/README.md
@@ -4,33 +4,14 @@
 [![documentation](https://readthedocs.org/projects/peak-performance/badge/?version=latest)](https://peak-performance.readthedocs.io/en/latest)
 [![DOI](https://zenodo.org/badge/713469041.svg)](https://zenodo.org/doi/10.5281/zenodo.10255543)
 
-# How to use PeakPerformance
-For installation instructions, see `Installation.md`.
-For instructions regarding the use of PeakPerformance, check out the example notebook(s) under `notebooks`, the complementary example data under `example`, and the following introductory explanations.
-
-## Preparing raw data
-This step is crucial when using PeakPerformance. Raw data has to be supplied as time series meaning for each signal you want to analyze, save a NumPy array consisting of time in the first dimension and intensity in the second dimension (compare example data). Both time and intensity should also be NumPy arrays. If you e.g. have time and intensity of a singal as lists, you can use the following code to convert, format, and save them in the correct manner:
-
-```python
-import numpy as np
-from pathlib import Path
-
-time_series = np.array([np.array(time), np.array(intensity)])
-np.save(Path(r"example_path/time_series.npy"), time_series)
-```
-
-The naming convention of raw data files is `<acquisition name>_<precursor ion m/z or experiment number>_<product ion m/z start>_<product ion m/z end>.npy`. There should be no underscores within the named sections such as `acquisition name`. Essentially, the raw data names include the acquisition and mass trace, thus yielding a recognizable and unique name for each isotopomer/fragment/metabolite/sample.
-
-## Model selection
-When it comes to selecting models, PeakPerformance has a function performing an automated selection process by analyzing one acquisiton per mass trace with all implemented models. Subsequently, all models are ranked based on an information criterion (either pareto-smoothed importance sampling leave-one-out cross-validation or widely applicable information criterion). For this process to work as intended, you need to specify acquisitions with representative peaks for each mass trace (see example notebook 1). If e.g. most peaks of an analyte show a skewed shape, then select an acquisition where this is the case. For double peaks, select an acquision where the peaks are as distinct and comparable in height as possible.
-Since model selection is a computationally demanding and time consuming process, it is suggested to state the model type as the user (see example notebook 1) if possible.
-
-## Troubleshooting
-### A batch run broke and I want to restart it.
-If an error occured in the middle of a batch run, then you can use the `pipeline_restart` function in the `pipeline` module to create a new batch which will analyze only those samples, which have not been analyzed previously.
-
-### The model parameters don't converge and/or the fit does not describe the raw data well.
-Check the separate file `How to adapt PeakPerformance to your data`.
+# About PeakPerformance
+PeakPerformance employs Bayesian modeling for chromatographic peak data fitting.
+This has the innate advantage of providing uncertainty quantification while jointly estimating all peak parameters united in a single peak model.
+As Markov Chain Monte Carlo (MCMC) methods are utilized to infer the posterior probability distribution, convergence checks and the aformentioned uncertainty quantification are applied as novel quality metrics for a robust peak recognition.
+
+# First steps
+Be sure to check out our thorough [documentation](https://peak-performance.readthedocs.io/en/latest). It contains not only information on how to install PeakPerformance and prepare raw data for its application but also detailed treatises about the implemented model structures, validation with both synthetic and experimental data against a commercially available vendor software, exemplary usage of diagnostic plots and investigation of various effects.
+Furthermore, you will find example notebooks and data sets showcasing different aspects of PeakPerformance.
 
 # How to contribute
 If you encounter bugs while using PeakPerformance, please bring them to our attention by opening an issue. When doing so, describe the problem in detail and add screenshots/code snippets and whatever other helpful material you can provide.

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -3,4 +3,5 @@ myst-nb
 numpydoc
 nbsphinx
 sphinx-book-theme
+sphinxcontrib.bibtex
 sphinxcontrib.mermaid
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -37,9 +37,22 @@
     "numpydoc",
     "myst_nb",
     "sphinx_book_theme",
+    "sphinxcontrib.bibtex",
     "sphinxcontrib.mermaid",
 ]
+myst_enable_extensions = [
+    "amsmath",  # needed for LaTeX math environments
+    "colon_fence",
+    "dollarmath",  # needed for $ and $$ math
+    "html_image",
+    "replacements",
+    "strikethrough",
+    "tasklist",
+]
 nb_execution_mode = "off"
+bibtex_bibfiles = ["literature.bib"]
+bibtex_default_style = "plain"
+bibtex_reference_style = "label"
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -35,6 +35,7 @@ The documentation features various notebooks that demonstrate the usage and inve
    :maxdepth: 1
 
    markdown/Installation
+   markdown/Preparing_raw_data
    markdown/Peak_model_composition
    markdown/PeakPerformance_validation
    markdown/PeakPerformance_workflow

diff --git a/docs/source/literature.bib b/docs/source/literature.bib
@@ -0,0 +1,214 @@
+@misc{nutpie,
+  author   = {Seyboldt, Adrian and {PyMC Developers}},
+  keywords = {Software},
+  license  = {MIT},
+  title    = {{nutpie}},
+  url      = {https://github.com/pymc-devs/nutpie}
+}
+
+@article{scipy,
+  author  = {Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and
+             Haberland, Matt and Reddy, Tyler and Cournapeau, David and
+             Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and
+             Bright, Jonathan and {van der Walt}, St{\'e}fan J. and
+             Brett, Matthew and Wilson, Joshua and Millman, K. Jarrod and
+             Mayorov, Nikolay and Nelson, Andrew R. J. and Jones, Eric and
+             Kern, Robert and Larson, Eric and Carey, C J and
+             Polat, {\.I}lhan and Feng, Yu and Moore, Eric W. and
+             {VanderPlas}, Jake and Laxalde, Denis and Perktold, Josef and
+             Cimrman, Robert and Henriksen, Ian and Quintero, E. A. and
+             Harris, Charles R. and Archibald, Anne M. and
+             Ribeiro, Ant{\^o}nio H. and Pedregosa, Fabian and
+             {van Mulbregt}, Paul and {SciPy 1.0 Contributors}},
+  title   = {{{SciPy} 1.0: {F}undamental Algorithms for Scientific Computing in {P}ython}},
+  journal = {Nature Methods},
+  year    = {2020},
+  volume  = {17},
+  pages   = {261--272},
+  adsurl  = {https://rdcu.be/b08Wh},
+  doi     = {10.1038/s41592-019-0686-2}
+}
+
+@article{matplotlib,
+  author    = {Hunter, J. D.},
+  title     = {Matplotlib: A 2D graphics environment},
+  journal   = {Computing in Science \& Engineering},
+  volume    = {9},
+  number    = {3},
+  pages     = {90--95},
+  abstract  = {Matplotlib is a 2D graphics package used for Python for
+               application development, interactive scripting, and publication-quality
+               image generation across user interfaces and operating systems.},
+  publisher = {IEEE COMPUTER SOC},
+  doi       = {10.1109/MCSE.2007.55},
+  year      = 2007
+}
+
+@misc{matplotlibzenodo,
+  author    = {{The Matplotlib Development Team}},
+  title     = {Matplotlib: Visualization with Python},
+  keywords  = {software},
+  month     = may,
+  year      = 2024,
+  publisher = {Zenodo},
+  version   = {v3.9.0},
+  doi       = {10.5281/zenodo.11201097},
+  url       = {https://doi.org/10.5281/zenodo.11201097}
+}
+
+@article{RN173,
+  author  = {Hoffmann, Matthew D. and Gelman, Andrew},
+  title   = {The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo},
+  journal = {Journal of Machine Learning Research},
+  volume  = {15},
+  year    = {2014},
+  type    = {Journal Article}
+}
+
+@article{RN150,
+  author  = {Abril-Pla, O. and Andreani, V. and Carroll, C. and Dong, L. and Fonnesbeck, C. J. and Kochurov, M. and Kumar, R. and Lao, J. and Luhmann, C. C. and Martin, O. A. and Osthege, M. and Vieira, R. and Wiecki, T. and Zinkov, R.},
+  title   = {{PyMC}: a modern, and comprehensive probabilistic programming framework in Python},
+  journal = {PeerJ Comput Sci},
+  volume  = {9},
+  pages   = {e1516},
+  issn    = {2376-5992 (Electronic)
+             2376-5992 (Linking)},
+  doi     = {10.7717/peerj-cs.1516},
+  url     = {https://www.ncbi.nlm.nih.gov/pubmed/37705656},
+  year    = {2023},
+  type    = {Journal Article}
+}
+
+@book{RN162,
+  author  = {Kruschke, John K.},
+  title   = {Doing Bayesian Data Analysis},
+  edition = {1st Edition},
+  publisher={Academic Press},
+  isbn    = {9780123814852},
+  year    = {2010},
+  type    = {Book}
+}
+
+@article{RN144,
+  author  = {Azzalini, A.},
+  title   = {A class of distributions which includes the normal ones},
+  journal = {Scand. J. Statist.},
+  volume  = {12},
+  pages   = {171-178},
+  year    = {1985},
+  type    = {Journal Article}
+}
+
+
+@article{RN152,
+  author  = {Gelman, Andrew and Rubin, Donald B.},
+  title   = {Inference from Iterative Simulation Using Multiple Sequences},
+  journal = {Statistical Science},
+  volume  = {7},
+  number  = {4},
+  year    = {1992},
+  type    = {Journal Article}
+}
+
+@article{RN153,
+  author  = {Grushka, E.},
+  title   = {Characterization of exponentially modified Gaussian peaks in chromatography},
+  journal = {Anal Chem},
+  volume  = {44},
+  number  = {11},
+  pages   = {1733-8},
+  issn    = {0003-2700 (Print)
+             0003-2700 (Linking)},
+  doi     = {10.1021/ac60319a011},
+  url     = {https://www.ncbi.nlm.nih.gov/pubmed/22324584},
+  year    = {1972},
+  type    = {Journal Article}
+}
+
+@article{RN149,
+  author  = {Hemmerich, J. and Noack, S. and Wiechert, W. and Oldiges, M.},
+  title   = {Microbioreactor Systems for Accelerated Bioprocess Development},
+  journal = {Biotechnol J},
+  volume  = {13},
+  number  = {4},
+  pages   = {e1700141},
+  issn    = {1860-7314 (Electronic)
+             1860-6768 (Linking)},
+  doi     = {10.1002/biot.201700141},
+  url     = {https://www.ncbi.nlm.nih.gov/pubmed/29283217},
+  year    = {2018},
+  type    = {Journal Article}
+}
+
+@article{RN148,
+  author  = {Kostov, Y. and Harms, P. and Randers-Eichhorn, L. and Rao, G.},
+  title   = {Low-cost microbioreactor for high-throughput bioprocessing},
+  journal = {Biotechnol Bioeng},
+  volume  = {72},
+  number  = {3},
+  pages   = {346-52},
+  issn    = {0006-3592 (Print)
+             0006-3592 (Linking)},
+  doi     = {10.1002/1097-0290(20010205)72:3<346::aid-bit12>3.0.co;2-x},
+  url     = {https://www.ncbi.nlm.nih.gov/pubmed/11135205},
+  year    = {2001},
+  type    = {Journal Article}
+}
+
+@article{RN145,
+  author  = {Vehtari, Aki and Gelman, Andrew and Gabry, Jonah},
+  title   = {Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC},
+  journal = {Statistics and Computing},
+  volume  = {27},
+  number  = {5},
+  pages   = {1413-1432},
+  issn    = {0960-3174
+             1573-1375},
+  doi     = {10.1007/s11222-016-9696-4},
+  year    = {2016},
+  type    = {Journal Article}
+}
+
+@article{RN146,
+  author  = {Watanabe, Sumio},
+  title   = {Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory},
+  journal = {Journal of machine learning research},
+  volume  = {11},
+  pages   = {3571-3594},
+  year    = {2010},
+  type    = {Journal Article}
+}
+
+@article{RN147,
+  author  = {Kumar, Ravin and Carroll, Colin and Hartikainen, Ari and Martin, Osvaldo},
+  title   = {ArviZ a unified library for exploratory analysis of Bayesian models in Python},
+  journal = {Journal of Open Source Software},
+  volume  = {4},
+  number  = {33},
+  issn    = {2475-9066},
+  doi     = {10.21105/joss.01143},
+  year    = {2019},
+  type    = {Journal Article}
+}
+
+@article{harris2020array,
+	title         = {Array programming with {NumPy}},
+	author        = {Harris, C. R. and Millman, K. J. and
+	{van der Walt}, S. J. and Gommers, R. and Virtanen, P. and
+	Cournapeau, D. and Wieser, E. and Taylor, J. and
+	Berg, S. and Smith, N. J. and Kern, R. and Picus, M.
+	and Hoyer, S. and {van Kerkwijk}, M. H. and
+	Brett, M. and Haldane, M. and del R{\'{i}}o, J. F. and Wiebe, M. and Peterson, P. and
+	G{\'{e}}rard-Marchant, P. and Sheppard, K. and Reddy, T. and
+	Weckesser, W. and Abbasi, H. and Gohlke, C. and
+	Oliphant, T. E.},
+	year          = {2020},
+	month         = sep,
+	journal       = {Nature},
+	volume        = {585},
+	number        = {7825},
+	pages         = {357--362},
+	doi           = {10.1038/s41586-020-2649-2},
+	publisher     = {Springer Science and Business Media {LLC}},
+	url           = {https://doi.org/10.1038/s41586-020-2649-2}
+}
diff --git a/docs/source/markdown/Diagnostic_plots.md b/docs/source/markdown/Diagnostic_plots.md
@@ -0,0 +1,14 @@
+# Diagnostic plots
+
+An important feature of `PeakPerformance` is constituted by the easy access to diagnostic metrics for extensive quality control.
+Using the data stored in an inference data object of a fit, the user can utilize the ArviZ package to generate various diagnostic plots.
+One particularly useful one is the cumulative posterior predictive plot portrayed in Figure 1.
+This plot enables users to judge the quality of a fit and identify instances of lack-of-fit.
+As can be seen in the left plot, some predicted intensity values in the lowest quantile of the single peak example show a minimal lack-of-fit.
+Importantly, such a deviation can be observed, judged and is quantifiable which intrinsically represents a large improvement over the status quo.
+
+```{figure-md} fig_d1
+![](./Fig5_ppc.png)
+
+__Figure 1:__ Cumulative posterior predictive plots created with the ArviZ package and pertaining to the example data of the single His peak (left) and the double Leu and Ile peak (right). The empirical cumulative density function (black) is in good agreement with the median posterior predictive (orange) and lies within the predicted variance (blue band), visualizing that the model provides an adequate prediction irrespective of the intensity value.
+```
diff --git a/docs/source/markdown/Fig1_model_single_peak.png b/docs/source/markdown/Fig1_model_single_peak.png
diff --git a/docs/source/markdown/Fig2_model_double_peak.png b/docs/source/markdown/Fig2_model_double_peak.png
diff --git a/docs/source/markdown/Fig3_PP-standalone.png b/docs/source/markdown/Fig3_PP-standalone.png
diff --git a/docs/source/markdown/Fig4_peak_results.png b/docs/source/markdown/Fig4_peak_results.png
diff --git a/docs/source/markdown/Fig5_ppc.png b/docs/source/markdown/Fig5_ppc.png
diff --git a/docs/source/markdown/Fig6_PP-validation.png b/docs/source/markdown/Fig6_PP-validation.png
diff --git a/How to adapt PeakPerformance to your data.md → ..._to_adapt_PeakPerformance_to_your_data.md b/How to adapt PeakPerformance to your data.md → ..._to_adapt_PeakPerformance_to_your_data.md
diff --git a/Installation.md → docs/source/markdown/Installation.md b/Installation.md → docs/source/markdown/Installation.md
@@ -2,7 +2,15 @@
 It is highly recommended to follow these steps:
 1. Install the package manager [Mamba](https://github.com/conda-forge/miniforge/releases).
 Choose the latest installer at the top of the page, click on "show all assets", and download an installer denominated by "Mambaforge-version number-name of your OS.exe", so e.g. "Mambaforge-23.3.1-1-Windows-x86_64.exe" for a Windows 64 bit operating system. Then, execute the installer to install mamba and activate the option "Add Mambaforge to my PATH environment variable".
-(⚠ __WARNING__ ⚠: If you have already installed Miniconda, you can install Mamba on top of it but there are compatibility issues with Anaconda. The newest conda version should also work, just replace `mamba` with `conda` in step 2.)
+
+```{caution}
+If you have already installed Miniconda, you can install Mamba on top of it but there are compatibility issues with Anaconda.
+```
+
+```{note}
+The newest conda version should also work, just replace `mamba` with `conda` in step 2.)
+```
+
 2. Create a new Python environment (replace "name_of_environment" with your desired name) in the command line via
 ```
 mamba create -c conda-forge -n name_of_environment pymc nutpie arviz jupyter matplotlib openpyxl "python=3.10"