Can MokaPot Run PSM files? #24

daishavdw · 2021-03-16T17:46:26Z

daishavdw
Mar 16, 2021

Hello,
I am trying to format my PSM files so that I can run them through MokaPot, but am running into a couple issues. Is it possible to run MokaPot directly with a PSM file for input, or do I need to reformat it as a pin?
Thank you!

wfondrie · 2021-03-16T20:13:30Z

wfondrie
Mar 16, 2021
Maintainer

Hi @daishavdw,

Peptide-spectrum matches (PSMs) have historically been reported in a wide variety of formats. There are two formats that can be used by mokapot directly:

PepXML - This is an XML format for reporting PSMs that is supported by a decent number of tools. We've tested mokapot with PSMs reported by comet and MSFragger specifically, but any tool that reports various scores from the database search in the PepXML file should work.
PIN - This is the tab-delimited format that was developed for Percolator. Each column either represents metadata about a PSM or features that can be used to help score the PSMs.

Additionally, mokapot can use any pandas DataFrame, so long as the required columns are present. To do this, see the LinearPsmDataset documentation.

An important aspect to any input format are the features that are defined for mokapot. These features are scores from the database search engine or properties of the PSM that together help distinguish good PSMs from poor PSMs. Features are explicitly provided by tools when they output a PIN file and are automatically extracted from PepXML files by mokapot. When crafting your own features, one thing to keep in mind is that the features need to be relevant to the quality of a PSM, not merely good distinguishing between target and decoy PSMs.

1 reply

daishavdw Mar 25, 2021
Author

Thank you so much. The using the LinearPsmDataset function allowed us to load our data. From what we understand, the feature columns need to have numerical values that relate to the quality of the PSM, and other columns need to be dropped.
We are loading data from MsgfPlus into MokaPot. We have kept and dropped the following columns, and it is working. Is this the optimal way to be using MokaPot?

We have kept: scan, Precursor, IsotopeError, PrecursorError(ppm), Charge,
DeNovoScore, MSGFScore, SpecEValue, EValue, QValue,
probability, peptide, target_column
We dropped: Protein, SpecID, FragMethod,
#SpecFile, Peptide

wfondrie · 2021-03-25T17:42:27Z

wfondrie
Mar 25, 2021
Maintainer

Awesome - I'm glad it's working for you. I just have few pieces of advice:

We have kept: scan, Precursor, IsotopeError, PrecursorError(ppm), Charge,
DeNovoScore, MSGFScore, SpecEValue, EValue, QValue,
probability, peptide, target_column
We dropped: Protein, SpecID, FragMethod,
#SpecFile, Peptide

It is difficult to comment without knowing what each column contains. My guess is that they are the following:

scan - The integer scan number.
Precursor - A string containing the peptide sequence, modifications, and charge state.
IsotopeError - Integer indicating the isotope that was detected.
PrecursorError(ppm) - The error between the observed and theoretical precursor m/z in ppm.
Charge - The charge state as an integer.
DeNovoScore, MSGFScore, SpecEValue, EValue - Scores from MSGF+.
QValue, probability - Confidence estimates from MSGF+. I would be hesitant to use these as features as they may have used the target and decoy labels for their calculation.
peptide - A string indicating the peptide sequence with modifications.
target_column - A boolean indicating if the peptide is a target or decoy.
Protein - The protein(s) that the peptide may have been generated from (string).
SpecID - A string that uniquely identifies a mass spectrum.
FragMethod - A string indicating the fragmentation method.
#SpecFile - An int or string that specifies what file the spectrum is from.
Peptide - I would think that this is the same as "peptide" above.

If this is the case, I would create a LinearPsmDataset using something like:
(Note that I haven't actually run this code, so there might be a typo here or there.)

import mokapot
import pandas as pd

# Load the data from "psms.tsv":
psms = pd.read_table("psms.tsv")

# Drop the columns I suspect are problematic:
psms = psms.drop(columns=["QValue", "probability"])

# Charge is often better as a one-hot encoded feature:
charge_feat = pd.get_dummies(psms["charge"], prefix="charge")
psms = pd.concat([psms, charge_feat], axis=1)

# Create the dataset:
psms = mokapot.LinearPsmDataset(
    psms=df, # The dataframe
    target_column="target_column",
    spectrum_columns=("scan", "SpecID", "\#SpecFile",),
    peptide_column="peptide",
    protein_column="protein",
    feature_columns=(list(charge_feat.columns) + [
        "IsotopeError", 
        "PrecursorError(ppm)",
        "DeNovoScore",
        "MSGFScore",
        "SpecEValue",
    ])
)

However, what might be easier is to use the MSGF+ converter, msgf2pin, that is provided by Percolator. I have plans to add native support for MSGF+ to mokapot down the road, but it will be awhile.

0 replies

daishavdw · 2021-03-25T20:29:00Z

daishavdw
Mar 25, 2021
Author

I will try that. Thank you so much!

…

On Mar 25, 2021, at 11:42 AM, Will Fondrie ***@***.***> wrote: Awesome - I'm glad it's working for you. I just have few pieces of advice: We have kept: scan, Precursor, IsotopeError, PrecursorError(ppm), Charge, DeNovoScore, MSGFScore, SpecEValue, EValue, QValue, probability, peptide, target_column We dropped: Protein, SpecID, FragMethod, #SpecFile, Peptide It is difficult to comment without knowing what each column contains. My guess is that they are the following: scan - The integer scan number. Precursor - A string containing the peptide sequence, modifications, and charge state. IsotopeError - Integer indicating the isotope that was detected. PrecursorError(ppm) - The error between the observed and theoretical precursor m/z in ppm. Charge - The charge state as an integer. DeNovoScore, MSGFScore, SpecEValue, EValue - Scores from MSGF+. QValue, probability - Confidence estimates from MSGF+. I would be hesitant to use these as features as they may have used the target and decoy labels for their calculation. peptide - A string indicating the peptide sequence with modifications. target_column - A boolean indicating if the peptide is a target or decoy. Protein - The protein(s) that the peptide may have been generated from (string). SpecID - A string that uniquely identifies a mass spectrum. FragMethod - A string indicating the fragmentation method. #SpecFile - An int or string that specifies what file the spectrum is from. Peptide - I would think that this is the same as "peptide" above. If this is the case, I would create a LinearPsmDataset using something like: (Note that I haven't actually run this code, so there might be a typo here or there.) import mokapot import pandas as pd # Load the data from "psms.tsv": psms = pd.read_table("psms.tsv") # Drop the columns I suspect are problematic: psms = psms.drop(columns=["QValue", "probability"]) # Charge is often better as a one-hot encoded feature: charge_feat = pd.get_dummies(psms["charge"], prefix="charge") psms = pd.concat([psms, charge_feat], axis=1) # Create the dataset: psms = mokapot.LinearPsmDataset( psms=df, # The dataframe target_column="target_column", spectrum_columns=("scan", "SpecID", "\#SpecFile",), peptide_column="peptide", protein_column="protein", feature_columns=(list(charge_feat.columns) + [ "IsotopeError", "PrecursorError(ppm)", "DeNovoScore", "MSGFScore", "SpecEValue", ]) ) However, what might be easier is to use the MSGF+ converter <https://github.com/percolator/percolator/wiki/Interface#converters>, msgf2pin, that is provided by Percolator <http://percolator.ms/>. I have plans to add native support for MSGF+ to mokapot down the road, but it will be awhile. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ3H7WKEAGDZITBJXLE6ABLTFNYZPANCNFSM4ZJAVAUA>.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can MokaPot Run PSM files? #24

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Can MokaPot Run PSM files? #24

daishavdw Mar 16, 2021

Replies: 3 comments · 1 reply

wfondrie Mar 16, 2021 Maintainer

daishavdw Mar 25, 2021 Author

wfondrie Mar 25, 2021 Maintainer

daishavdw Mar 25, 2021 Author

daishavdw
Mar 16, 2021

Replies: 3 comments 1 reply

wfondrie
Mar 16, 2021
Maintainer

daishavdw Mar 25, 2021
Author

wfondrie
Mar 25, 2021
Maintainer

daishavdw
Mar 25, 2021
Author