-
A question from a Twitter DM, with minor edits: I specified reporting 12 matches per scan and the PIN file has 12 lines for each scan (Comet default is 5 matches). [In olden times, the first match did not have a deltaCN value. You had to go to the second match to get the deltaCN. I compute an alternative deltaCN from the top hit to the average of xcorrs for hits 4 to 12 for my pipeline discriminant function. Jimmy has since changed deltaCN to be the difference between dissimilar sequences, so I can probably ditch my alternative deltaCN.] Does mokapot ignore everything except for the “top” hit? If so, how do you define the top hit? There can be matches that are different peptide sequences but have the same xcorr values (top hit ties to the precision Jimmy outputs [4 decimal places]). How many output lines would you usually set in the params file? Just one? Do you know what units the dM column values are in? The experimental masses and calculated masses seem to be MH+ values in daltons. The differences in those values does not equal the numbers in the dM column. The numbers in the dM column are much smaller. The sign seems to agree (the delta is exp-calc). Does mokapot use delta masses in daltons or in PPM (or does it not matter)? The charge states look like a boolean grid with columns from 1 to 6. Are all 6 needed? I usually just accept 2+, 3+, and 4+ peptides on Orbi with Comet. Would 3 columns work or are all 6 needed? One last question. When doing a PPM delta mass, what mass is usually used in the denominator (the experimental mass, the theoretical mass, or an average of the two values)? Apologies for so many questions. Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Mokapot uses all of the hits during the semi-supervised model training, but only the top-hit is retained after target-decoy competition (TDC). This is necessary to maintain the theoretical guarantees that it can provide. Notably, the alternative "mix-max" procedure can allow for multiple PSMs per spectrum, but it is not implemented in mokapot; however, it is available with Percolator.
The top-hit for each spectrum is the one that has the highest score from the model that mokapot has learned. Ties, although extremely rare in this case, are broken randomly.
Because mokapot does use all of the PSMs in training, it can sometimes be useful to provide more than one hit per spectrum. I typically use the top 5.
I'm not entirely sure for Comet, since they are calculated internally. If mokapot is used to read the PepXML from Comet or another search engine, a
For mokapot it doesn't matter. Each is considered a feature that mokapot can learn from, so in some cases, including both may be helpful.
I would filter our PSMs that do not match the charge states you would accept, then drop their corresponding columns prior to analyzing them with mokapot. If we fileter PSMs after computing FDR, it invalidates the FDR estimate.
Mokapot doesn't calculate PPM delta mass (it is provided by search engines). That being said, I normally calculate it using the theoretical mass in the denominator. My reason is that we're essentially saying, "if we assume spectrum X was generated by peptide Y, then the mass error would be..." I imagine folks calculate it in many different ways. |
Beta Was this translation helpful? Give feedback.
Mokapot uses all of the hits during the semi-supervised model training, but only the top-hit is retained after target-decoy competition (TDC). This is necessary to maintain the theoretical guarantees that it can provide. Notably, the alternative "mix-max" procedure can allow for multiple PSMs per spectrum, but it is not implemented in mokapot; however, it is available with Percolator.
The top-hit for each spectrum is the one that has the highest score from the model that mokapot has learned. Ties, although extremely rare in this case, are broken randomly.