Replies: 3 comments 1 reply
-
Hi @daishavdw, Peptide-spectrum matches (PSMs) have historically been reported in a wide variety of formats. There are two formats that can be used by mokapot directly:
Additionally, mokapot can use any pandas DataFrame, so long as the required columns are present. To do this, see the LinearPsmDataset documentation. An important aspect to any input format are the features that are defined for mokapot. These features are scores from the database search engine or properties of the PSM that together help distinguish good PSMs from poor PSMs. Features are explicitly provided by tools when they output a PIN file and are automatically extracted from PepXML files by mokapot. When crafting your own features, one thing to keep in mind is that the features need to be relevant to the quality of a PSM, not merely good distinguishing between target and decoy PSMs. |
Beta Was this translation helpful? Give feedback.
-
Awesome - I'm glad it's working for you. I just have few pieces of advice:
It is difficult to comment without knowing what each column contains. My guess is that they are the following:
If this is the case, I would create a LinearPsmDataset using something like: import mokapot
import pandas as pd
# Load the data from "psms.tsv":
psms = pd.read_table("psms.tsv")
# Drop the columns I suspect are problematic:
psms = psms.drop(columns=["QValue", "probability"])
# Charge is often better as a one-hot encoded feature:
charge_feat = pd.get_dummies(psms["charge"], prefix="charge")
psms = pd.concat([psms, charge_feat], axis=1)
# Create the dataset:
psms = mokapot.LinearPsmDataset(
psms=df, # The dataframe
target_column="target_column",
spectrum_columns=("scan", "SpecID", "\#SpecFile",),
peptide_column="peptide",
protein_column="protein",
feature_columns=(list(charge_feat.columns) + [
"IsotopeError",
"PrecursorError(ppm)",
"DeNovoScore",
"MSGFScore",
"SpecEValue",
])
) However, what might be easier is to use the MSGF+ converter, |
Beta Was this translation helpful? Give feedback.
-
I will try that. Thank you so much!
… On Mar 25, 2021, at 11:42 AM, Will Fondrie ***@***.***> wrote:
Awesome - I'm glad it's working for you. I just have few pieces of advice:
We have kept: scan, Precursor, IsotopeError, PrecursorError(ppm), Charge,
DeNovoScore, MSGFScore, SpecEValue, EValue, QValue,
probability, peptide, target_column
We dropped: Protein, SpecID, FragMethod,
#SpecFile, Peptide
It is difficult to comment without knowing what each column contains. My guess is that they are the following:
scan - The integer scan number.
Precursor - A string containing the peptide sequence, modifications, and charge state.
IsotopeError - Integer indicating the isotope that was detected.
PrecursorError(ppm) - The error between the observed and theoretical precursor m/z in ppm.
Charge - The charge state as an integer.
DeNovoScore, MSGFScore, SpecEValue, EValue - Scores from MSGF+.
QValue, probability - Confidence estimates from MSGF+. I would be hesitant to use these as features as they may have used the target and decoy labels for their calculation.
peptide - A string indicating the peptide sequence with modifications.
target_column - A boolean indicating if the peptide is a target or decoy.
Protein - The protein(s) that the peptide may have been generated from (string).
SpecID - A string that uniquely identifies a mass spectrum.
FragMethod - A string indicating the fragmentation method.
#SpecFile - An int or string that specifies what file the spectrum is from.
Peptide - I would think that this is the same as "peptide" above.
If this is the case, I would create a LinearPsmDataset using something like:
(Note that I haven't actually run this code, so there might be a typo here or there.)
import mokapot
import pandas as pd
# Load the data from "psms.tsv":
psms = pd.read_table("psms.tsv")
# Drop the columns I suspect are problematic:
psms = psms.drop(columns=["QValue", "probability"])
# Charge is often better as a one-hot encoded feature:
charge_feat = pd.get_dummies(psms["charge"], prefix="charge")
psms = pd.concat([psms, charge_feat], axis=1)
# Create the dataset:
psms = mokapot.LinearPsmDataset(
psms=df, # The dataframe
target_column="target_column",
spectrum_columns=("scan", "SpecID", "\#SpecFile",),
peptide_column="peptide",
protein_column="protein",
feature_columns=(list(charge_feat.columns) + [
"IsotopeError",
"PrecursorError(ppm)",
"DeNovoScore",
"MSGFScore",
"SpecEValue",
])
)
However, what might be easier is to use the MSGF+ converter <https://github.com/percolator/percolator/wiki/Interface#converters>, msgf2pin, that is provided by Percolator <http://percolator.ms/>. I have plans to add native support for MSGF+ to mokapot down the road, but it will be awhile.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQ3H7WKEAGDZITBJXLE6ABLTFNYZPANCNFSM4ZJAVAUA>.
|
Beta Was this translation helpful? Give feedback.
-
Hello,
I am trying to format my PSM files so that I can run them through MokaPot, but am running into a couple issues. Is it possible to run MokaPot directly with a PSM file for input, or do I need to reformat it as a pin?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions