-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gene level parsimony option in protein association dialog #2764
Conversation
chambm
commented
Oct 31, 2023
- added gene level parsimony option in protein association dialog
- bumped schema to 23.11 for gene level parsimony
- updated audit logs
- fixed Skyline build system to copy vendor DLLs to both Debug and Release directories so vendor reader tests don't fail in Debug config
- removed MinPeptidesPerProtein test from TestHugeAssociateProteins because it's now hidden in the Refine -> Associate Proteins dialog
@@ -1276,6 +1276,8 @@ public bool ImportingSearch | |||
|
|||
public static readonly Argument ARG_AP_GROUP_PROTEINS = new Argument(@"associate-proteins-group-proteins", | |||
(c, p) => c.AssociateProteinsGroupProteins = p.IsNameOnly || bool.Parse(p.Value)); | |||
public static readonly Argument ARG_AP_GENE_LEVEL = new Argument(@"associate-proteins-gene-level-parsimony", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a name-only argument, do you think we should have an argument called "associate-proteins-parsimony-level" where the valid values are "protein" (default) or "gene"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that in concert with using a combobox in the Associate Proteins dialog rather than 2 checkboxes. It was easier to implement as just an additional checkbox, and I didn't see a significant reason to prefer a combobox, but if you see one then I'm happy to consider it.
|
||
return new FastaRecord(longestProtein.RecordIndex, 0, | ||
new FastaSequence(longestProtein.Sequence.Name, longestProtein.Sequence.Description, null, | ||
string.Join("", proteinToPeptides.Keys.Select(o => o.Sequence.Sequence))), longestProtein.Metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brendanx67 This is what I came up with to pick/make a protein sequence that contains all the member peptides. It will use the longest sequence of the group if it contains all the peptides, and if it doesn't then it'll concatenate all the protein sequences together. Some alternatives might be: concatenate all the peptides instead of the proteins, or only concatenate enough proteins to contain all the peptides (i.e. a parsimonious approach to concatenating).
Or throw this hacky workaround away and go back to the drawing board, making gene group a new FastaSequenceGroup that allows its peptides to map to only some of the parent's FastaSequenceList. I honestly don't know how much work that would be though.
* bumped schema to 23.12 for gene level parsimony * fixed Skyline build system to copy vendor DLLs to both Debug and Release directories so vendor reader tests don't fail in Debug config * removed MinPeptidesPerProtein test from TestHugeAssociateProteins because it's now hidden in the Refine -> Associate Proteins dialog
712319c
to
e725032
Compare
@brendanx67 Good to go for this PR? |
…ptions * greatly improved speed by using Nick's ReferenceValue for PeptideDocNodes * fixed issues with cancellation in some parsimony calculations
…ocNode" because reference equality was not actually being used. Check for cancellation in loop which digests every protein sequence.
…k/20231025_gene_level_parsimony
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very nice.
Here are some comments.
The only comments that I actually care about are the two where "Distinct" is being called on a list of PeptideDocNode objects.
Feel free to ignore all my other comments.
@@ -561,11 +610,65 @@ public void ApplyParsimonyOptions(bool groupProteins, bool findMinimalProteinLis | |||
_finalResults.FinalSharedPeptideCount = sharedPeptidesRemaining.Values.Sum(); | |||
} | |||
|
|||
private Dictionary<IProteinRecord, PeptideAssociationGroup> CalculateProteinGroups(MappingResultsInternal results, ILongWaitBroker broker) | |||
public class GeneLevelEqualityComparer : EqualityComparer<IProteinRecord> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of writing an EqualityComparer, I probably would have written a new class called "GeneOrProteinSequence" which I used as the Key in the dictionary.
The reason is that it's easier to understand what is going on if the dictionary has a completely different key type, compared to if it has a special EqualityComparer.
But, this way is fine, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I wrote this, I realized I could have just used the gene name (or protein name if gene isn't set) as the key. Does that seem sensible to you as well?
if (proteinToPeptides.Count == 1) | ||
return proteinToPeptides.Keys.First(); | ||
|
||
var longestProtein = proteinToPeptides.OrderByDescending(kvp2 => kvp2.Key.Sequence.Sequence.Length).First().Key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is checking whether the longest protein sequence contains all of the peptide sequences.
It's true that the longest protein sequence is the one that is most likely to contain all of the peptide sequences, but it is theoretically possible that a shorter protein sequence in the group might happen to contain all the peptide sequences.
(This probably doesn't matter).