Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing "report_psms" to a super high value (eg >> 500) seems necessary to find peptides of very high rank (eg < 5) #143

Open
gsaxena888 opened this issue Jul 22, 2024 · 6 comments

Comments

@gsaxena888
Copy link

Sage 0.14.7 has been working great for me so far. However, I found a few circumstances when I couldn't explain why it didn't at least list a possible peptide within the top 500 or so. For example, in the attached tar file, I have provided a small mgf (of one entry) along with the config.json file, fasta file, and resulst.sage.tsv file. When I ran the sage0.14.7 search, I could not find the following entry of

C[+57.0216]EAC[+57.0216]PPGYSGPTHQGVGLAFAK

(which should have matched at least 5 MS2 m/z fragments), even though the search output listed peptides with fewer than 5 fragments matching it. However, when I change the report_psms parameter in the config.json file to 5000, as in:

 "report_psms": 5000,

sage0.14.7 finds (correctly) this peptide as ranks it as number 3 with 7 fragments matched. (I've noticed this for many other scenarios as well, which is why I've arbitrarily set the "report_psms" to a high number like 500, even though I only really care about the first 10 to 50 possible matches). Thoughts?

(Also, this may or may not be related, but I've noticed that (a) changing the "report_psms" parameter impacts the spectrum_q score, but I can't figure out why that should be the case since I would have thought that "report_psms" is simply a parameter to determine what gets reported out but it should not impact internal calculations; also, (b) I've noticed that setting "precursor_tol" to much much higher values than I technically need (e.g.,:

        "da": [
            -97.6,
           97.6
        ]
    },

(instead of +- 1.5Da), seems to substantially impact the spectrum_q scores. (In particular, when I experimented, it seemed like a value of +- 96Da would not only decrease the spectrum_q values substantially, but seemed to rank the possible peptides a bit more accurately. Is that the 'expected' behaviour and if so, could you help me/us understand why this is? (My guess is that it has to do with having sufficient false matches to model the spectrum_q score accurately). (And super minor note: but when I run sage -V, I think it prints out the wrong version of 0.14.6 instead of 0.14.7)

Thanks in advance!

proofOfMissingPeptide.tar.gz

@lazear
Copy link
Owner

lazear commented Jul 22, 2024

  • Sage uses linear discriminant analysis to rank all hits (e.g. all PSMs found in a search, across all spectra) together to try and maximize separation across targets and decoys. Modulating report_psms therefore changes the inputs to LDA, which will determine the q-value assigned to each PSM.
  • LDA is not going to work for scoring 5000 PSMs from the same spectrum. Your q-values will be inherently invalid/muted, since we expect there to be only a handful of "true" hits for any given spectrum, followed by an equal mixture of targets and decoys (as ranked by hyperscore/matched peaks/Xcorr/etc)
  • Internally, Sage ranks candidate hits by # of matched peaks. Some number of them (the greater of 50 or report_psms * 2) - ranked by # of matched peaks - are selected for further rescoring (hyperscore, ion ladder lengths, etc). If you have, say 100 candidate peptides with 7+ matched peaks, it is therefore possible to "miss" peptides with 7 matched peaks (is it really a missing peptide if it is not in the top 100 candidates ranked by # of matched peaks?) even with report_psms = 50

So with the above in mind, I'm not concerned with interactions between report_psms and spectrum_q - they are behaving as they should. However, I did verify the missing the peptide you mentioned, and that is strange behavior. I'll check it out and get back to you.

@gsaxena888
Copy link
Author

ok, cool -- I'm looking forward to hearing your thoughts regarding the alleged missing peptide etc. (One "ugly" workaround I'm thinking about, if a fix is time consuming etc, is to output an arbitrarily super high number of matches per spectra, eg, 5000, and then later filter out anything beyond say the top 50, since that "seems" to find the missing peptide, but a) I have no idea whether even 5000 is sufficiently large and b) I'm unsure if there are other hidden issues that this 'missing' peptide may be shedding light on, even if those issues are ultimately minor etc....)

@lazear
Copy link
Owner

lazear commented Jul 25, 2024

Ok, so this "missing" peptide is an edge case due to one of the default heuristics used for calculating the initial list of candidates.

By default, the parameter database.min_ion_index = 2. This removes b1/b2/y1/y2 ions from the fragment ion index, since they are less discriminating (many peptides will share b1/y1, etc).

As mentioned above, Sage calculates a preliminary score for all candidate peptides within the precursor tolerance of a spectrum. This score is the number of matched fragment ions from the index - so by default b1/b2/y1/y2 ions will not contribute to this score.

The candidate peptides are then ranked, and the top N are selected for full scoring - hyperscore, accurate # of matched peaks (including any ions removed from the fragment ion index, and without potentially counting the same fragment twice).

For the parameters you attached, the initial preliminary hit distribution will look like this:
image

It seems a safe assumption that the best hit will be in the top-N preliminary candidates. Normally, we select at least 50 candidates. In the example above, that ensures that the candidate with 10 peaks, the one with 6 peaks and some of the candidates with 5 peaks will be selected.

The peptide in your example is what I would call a pathological case - it has 7 matched peaks after full scoring, but only 3 preliminary matched peaks.

The goal of Sage is to be fast - and we sacrifice comprehensively scoring every possible candidate to achieve this, since it works very well in practice. You can set database.min_ion_index = 0 to include more ions in the initial fragment index, but otherwise this behavior is expected.

@gsaxena888
Copy link
Author

ok, silly follow-up question: how do we set database.min_ion_index to 0? I couldn't find that specific setting in the documented config.json description (or in the automatically outputted configuration json file etc). (Also, I'm assuming that if I set it to 0, I could revert the report_psms parameter back to what I really care about, e.g., somewhere between 10 to 50, and these 'edge cases' peptides should now appear even with the more realistic report_psms values?) Thanks in advance, Michael.

@lazear
Copy link
Owner

lazear commented Jul 25, 2024

See below for an example.
Also, for the peptide of interest here, the 3 fragments that are not matched in the preliminary pass are doubly-charged y17, y18, y19 ions - the final mass values here end up being > fragment_max_mz = 1500 (which is really fragment uncharged monoisotopic mass). I would recommend changing this parameter to a larger value in the meantime, and I will look into changing this default to account for multiply charged ions with high m/z values (which end up having uncharged masses over the default threshold).

With database.fragment_max_mz = 3500 and database.min_ion_index = 0, I can identify the peptide of interest using report_psms = 10

{
  "version": "0.14.7",
  "database": {
    "bucket_size": 8192,
    "enzyme": {
      "missed_cleavages": 2,
      "min_len": null,
      "max_len": null,
      "cleave_at": "KR",
      "restrict": null,
      "c_terminal": null,
      "semi_enzymatic": null
    },
    "fragment_min_mz": 150.0,
    "fragment_max_mz": 1500.0,
    "peptide_min_mass": 500.0,
    "peptide_max_mass": 5000.0,
    "ion_kinds": [
      "b",
      "y"
    ],
    "min_ion_index": 0,
  },
}

@lazear
Copy link
Owner

lazear commented Jul 25, 2024

d1280a7 removes the fragment_min_mz and fragment_max_mz filters entirely. I expect this should decrease the number of edge cases like above - we'll see how it looks in some test datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants