Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature update: calculateFragments2 to include modifications #16

Merged
merged 25 commits into from
Jan 20, 2025

Conversation

guideflandre
Copy link
Contributor

The current function calculateFragments uses modifications as fixed modifications, thus not giving all possibilities or combinations of modifications. Instead, I propose calculateFragments2 that replaces the orginal modification parameter with these three parameters:

  • fixed_modifications applies fixed modifications regardless of max_mods. Much like how the current calculateFragments applies modifications
  • variable_modifications applies modifications based on all combinations possible and limited to the number of max_mods. If peptide ARGHKA has variable_modifications = c(A = 10, G = 20), max_mods = 2, then there will be 7 possibilities: "[A]R[G]HKA" "[A]RGHK[A]" "[A]RGHKA" "AR[G]HK[A]" "AR[G]HKA" "ARGHK[A]" "ARGHKA"
  • max_mods sets a limit for the maximus amount of variable modifications at once on the peptide. By default, it's set at +Inf (allowing all combinations possible - no restraint on the maximum amount of mods at once).

The returning value is a dataframe with the same columns and output as the original function, simply adding lines for new possibilities of modifications combinations. There is an additional column peptide that gives the modified sequence that the fragment belongs to (amino acids within brackets are modified, only variable modifications are shown).

I have added the necessary documentation on the function.

@lgatto @sgibb

@guideflandre
Copy link
Contributor Author

The function also resolves 2 of the current open issues in PSMatch

Copy link
Member

@lgatto lgatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I haven't looked at the code yet. This will follow.
  • I am wondering whether it wouldn't be best for immediately merge both functions. Or may be as a second PR, and only momentarily keep the two variants. Either there's a single function/code, ou there could be two, that are called based on the presence of variable modifications - something along the lines of
if (variable_modifications) {
      calculateFragments2(...)
} else {
   calculateFragments(...)
}

To be discussed as part of the review.

@lgatto
Copy link
Member

lgatto commented Dec 24, 2024

@guideflandre - please also amend the latest entry in the NEWS.md files.

@lgatto
Copy link
Member

lgatto commented Dec 24, 2024

Following up on the merging of calculateFragments() and calculateFragments2(), given that the former is a method and the latter a function, I suggest we keep the two separate first, merge the documentation, make sure that the unit tests confirm that their behaviour are identical when calculateFragments2() has no variable modifications, and consider merging later.

We will need to consider how to handle the modifiations from calculateFragments() and fixed_modifications from calculateFragments2(). I would suggest to keep both for backwards compatibility, and made the latter a synonym of the former, and if fixed_modifications is set, modifiations is ignored.

@sgibb
Copy link
Member

sgibb commented Dec 25, 2024

Dear @guideflandre, thanks for this contribution. Before looking at the code I am wondering whether we are really need/want this "brute-force" algorithm. I am afraid the number of combinations would explode easily.

If we want to solve #14 and #15 that both ask for a modification on a specific position we may should use a different solution like (partly) supporting ProForma (instead of creating tens, hundreds or thousands of combinations to get just one specific fragment with the modification the user is interested in).

Please see my PR #17 for an example implementation.

@lgatto
Copy link
Member

lgatto commented Dec 26, 2024

Hi @sgibb - I think this and #17 address partly overlapping, but different use cases. The brute force approach variable_modifications = c(S = 79) addresses the ProForma <S79>ARGSHKSATC syntax, but there would still be a need to compute all possible peptidoform sequences and their fragments if that's what we want. The use case we have here is really to compute all possible fragments with variable modifications.

@sgibb
Copy link
Member

sgibb commented Dec 26, 2024

@lgatto thanks for the clarification. In that case I agree that we should merge calculateFragments and calculateFragments2 to avoid code duplication and keep it simple for the user (or find a better name for calculateFragments2, e.g. calculateFragmentsVariableMods ...).

Copy link
Member

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't not completely review everything (e.g. I didn't look at the calculateFragment2 integrate of the calculated combinations yet).

@guideflandre
Copy link
Contributor Author

I benchmarked the current version of calculateFragments and calculateFragments2 and here are the results for a thousand runs without modifications:

> result <- microbenchmark(calculateFragments("PQRAGRTKIPK", verbose = FALSE), 
                           calculateFragments2("PQRAGRTKIPK", verbose = FALSE), 
                           times = 1000L)
> result
Unit: milliseconds
             expr         min        lq      mean     median     uq      max    neval
 calculateFragments      1.852632 1.917022 2.016044 1.949842 2.004322 14.56149  1000
 calculateFragments2     1.551952 1.620542 1.751102 1.647552 1.696492 16.93496  1000

calculateFragments2 seems to perform ever-so-slightly better. The same is true when benchmarking modifications against fixed_modifications. In case of variable_modificaitons, calculateFragments2 is naturally slower.

@guideflandre
Copy link
Contributor Author

guideflandre commented Jan 9, 2025

In order to merge calculateFragments and calculateFragments2, two "conflicts" arise:

  1. calculateFragments uses the paramater modifications whereas calculateFragments2 uses fixed_modifications. Their functionality is identical, simply their names differ. To avoid making users' code throw errors when the parameter modifications is called but not recognised by calculateFragments2, I added a warning at the start of the function saying modifications is deprecated, while replacing fixed_modifications with modifications:
calculateFragments2 <- function(sequence, 
                                fixed_modifications = c(C = 57.02146),
                                modifications = NULL, ...) {
    
    if (!is.null(modifications)) {
        warning("'modifications' is deprecated, please use 'fixed_modifications' instead.")
        fixed_modifications <- modifications
    }

This way, all cases where modifications == NULL | modifications != NULL and/or fixed_modifications == NULL | fixed_modifications != NULL result in the code working. This is but a temporary approach, ideally removing the modifications parameter in the future.
I can generate unit tests for these, do tell me if you think another approach should be used.

  1. The functions dependent on calculateFragments are: addFragments and plotSpectra from the Spectra package, the latter being dependent on the former. In case when variable_modifications is called, addFragments fails due to duplicated mz values. A remedy for this could be this edited version of addFragments where a list of labels is generated in case variable_modifications is called:
addFragments2 <- function (x, tolerance = 0, ppm = 20, ...) {
    stopifnot(requireNamespace("Spectra"))
    stopifnot(inherits(x, "Spectra"), length(x) == 1)
    stopifnot("sequence" %in% Spectra::spectraVariables(x))
    y <- Spectra::spectraData(x)[["sequence"]]
    x_data <- Spectra::peaksData(x)[[1L]]
    y_data <- calculateFragments2(y, verbose = FALSE, ...)
    
    if (length(unique(y_data$peptide))>1) { ## recognises that variable modifications have been used
        y_data <- split(y_data, y_data$peptide)
        labels <- list()
        for (i in 1:NROW(y_data)) {
            y_data[[i]] <- y_data[[i]][order(y_data[[i]]$mz), ]
            idx <- which(MsCoreUtils::common(x_data[, "mz"],
                                             y_data[[i]][,"mz"],
                                             tolerance = tolerance,
                                             ppm = ppm))
            idy <- which(MsCoreUtils::common(y_data[[i]][, "mz"],
                                             x_data[,"mz"], 
                                             tolerance = tolerance, 
                                             ppm = ppm))
            
            labels[[i]] <- rep(NA_character_, nrow(x_data))
            labels[[i]][idx] <- y_data[[i]][idy, "ion"]
        } 
        labels
    } else { ## current addFragment function
        y_data <- y_data[order(y_data$mz), ]
        idx <- which(MsCoreUtils::common(x_data[, "mz"],
                                         y_data[,"mz"],
                                         tolerance = tolerance,
                                         ppm = ppm))
        idy <- which(MsCoreUtils::common(y_data[, "mz"],
                                         x_data[,"mz"], 
                                         tolerance = tolerance, 
                                         ppm = ppm))
        
        labels <- rep(NA_character_, nrow(x_data))
        labels[idx] <- y_data[idy, "ion"]
        labels
    }
}

This way, the existing code will work with modifications and the plots are not disturbed as variable_modifications is not called. However, when variable_modifications is called, a list of labels is generated and the plotting throws an error.

I have been working on a new plotSpectra function that would allow variable_modifications as well as improve the visualisation (including the annotated fragments (color-coded) and the sequence in the plot):
image
I will create a PR for this new plotting function soon, but as the whole ecosystem is dependent on calculateFragments I thought it might be good to mention it here. In any case, we can start by accepting addFragments' annotated peaks from the sequence without variable modifications at first, and build on this when the new plotSpectra function is optimised to accept these modifications.

@lgatto @sgibb

Change modified peptide layout into AGC[57.02]AK instead of AG[C]AK to specify the modified mass
@lgatto
Copy link
Member

lgatto commented Jan 17, 2025

@sgibb - are you ok with this? If so, could you approve.

@lgatto
Copy link
Member

lgatto commented Jan 17, 2025

@guideflandre - once this PR is merged, I suggest you send a quick PR to replace calculateFragments() with the new version.

Copy link
Member

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor thing to discuss. Otherwise I am fine with that PR.

@sgibb
Copy link
Member

sgibb commented Jan 20, 2025

I benchmarked the current version of calculateFragments and calculateFragments2 and here are the results for a thousand runs without modifications:

> result <- microbenchmark(calculateFragments("PQRAGRTKIPK", verbose = FALSE), 
                           calculateFragments2("PQRAGRTKIPK", verbose = FALSE), 
                           times = 1000L)
> result
Unit: milliseconds
             expr         min        lq      mean     median     uq      max    neval
 calculateFragments      1.852632 1.917022 2.016044 1.949842 2.004322 14.56149  1000
 calculateFragments2     1.551952 1.620542 1.751102 1.647552 1.696492 16.93496  1000

calculateFragments2 seems to perform ever-so-slightly better. The same is true when benchmarking modifications against fixed_modifications. In case of variable_modificaitons, calculateFragments2 is naturally slower.

Without doing this benchmark on my own. How could it be that the new version that does more is faster than the old one?

@sgibb
Copy link
Member

sgibb commented Jan 20, 2025

@guideflandre I really like your plotSpectra method! Could you integrate the delta plot as well (#13 ; and create a PR for this)?

@guideflandre
Copy link
Contributor Author

Without doing this benchmark on my own. How could it be that the new version that does more is faster than the old one?

In PSMatch, calculateFragments is called through setMethod. The version I built is not. I believe this might be the source for that ? I intended to replace calculateFragments with calculateFragments2 in a new PR, which is why I didn't build a new method.

@guideflandre
Copy link
Contributor Author

@guideflandre I really like your plotSpectra method! Could you integrate the delta plot as well (#13 ; and create a PR for this)?

I have encountered some hurdles with the integration of my version of plotSpectra, especially for the other functions plotSpectraMirror and and plotSpectraOverlay. I will create an appropriate issue to discuss those with you, Laurent and Johannes. The delta plot is indeed and interesting touch I will look into

@sgibb
Copy link
Member

sgibb commented Jan 20, 2025

Without doing this benchmark on my own. How could it be that the new version that does more is faster than the old one?

In PSMatch, calculateFragments is called through setMethod. The version I built is not. I believe this might be the source for that ? I intended to replace calculateFragments with calculateFragments2 in a new PR, which is why I didn't build a new method.

Ok, indeed dispatching takes much time.

@sgibb sgibb self-requested a review January 20, 2025 17:14
Copy link
Member

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please fix the documentation:

❯ checking for code/documentation mismatches ... WARNING
  Codoc mismatches from Rd file 'calculateFragments2.Rd':
  calculateFragments2
    Code: function(sequence, type = c("b", "y"), z = 1,
                   fixed_modifications = c(C = 57.02146),
                   variable_modifications = numeric(), max_mods = Inf,
                   neutralLoss = defaultNeutralLoss(), verbose = TRUE,
                   modifications = NULL)
    Docs: function(sequence, type = c("b", "y"), z = 1,
                   fixed_modifications = c(C = 57.02146),
                   variable_modifications = NULL, max_mods = Inf,
                   neutralLoss = defaultNeutralLoss(), verbose = TRUE)
    Argument names in code not in docs:
      modifications
    Mismatches in argument default values:
      Name: 'variable_modifications' Code: numeric() Docs: NULL

https://github.com/rformassspectrometry/PSMatch/actions/runs/12868423898/job/35885823856?pr=16

@guideflandre
Copy link
Contributor Author

Could you please fix the documentation:

Fixed ! Sorry for the delay

@guideflandre
Copy link
Contributor Author

Once again, fixed the documentation, sorry for this:

❯ checking Rd \usage sections ... WARNING
  Undocumented arguments in Rd file 'calculateFragments2.Rd'
    ‘modifications’
  
  Functions with \usage entries need to have the appropriate \alias
  entries, and all their arguments documented.
  The \usage entries must correspond to syntactically valid R code.
  See chapter ‘Writing R documentation files’ in the ‘Writing R
  Extensions’ manual.

❯ checking R code for possible problems ... NOTE
  .modificationPositions : <anonymous>: no visible global function
    definition for ‘combn’
  Undefined global functions or variables:
    combn
  Consider adding
    importFrom("utils", "combn")
  to your NAMESPACE file.
```

@sgibb sgibb self-requested a review January 20, 2025 21:28
@sgibb sgibb merged commit a046b32 into rformassspectrometry:main Jan 20, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants