Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene level parsimony option in protein association dialog #2764

Merged
merged 7 commits into from
Jan 18, 2024

Conversation

chambm
Copy link
Member

@chambm chambm commented Oct 31, 2023

  • added gene level parsimony option in protein association dialog
  • bumped schema to 23.11 for gene level parsimony
  • updated audit logs
  • fixed Skyline build system to copy vendor DLLs to both Debug and Release directories so vendor reader tests don't fail in Debug config
  • removed MinPeptidesPerProtein test from TestHugeAssociateProteins because it's now hidden in the Refine -> Associate Proteins dialog

@@ -1276,6 +1276,8 @@ public bool ImportingSearch

public static readonly Argument ARG_AP_GROUP_PROTEINS = new Argument(@"associate-proteins-group-proteins",
(c, p) => c.AssociateProteinsGroupProteins = p.IsNameOnly || bool.Parse(p.Value));
public static readonly Argument ARG_AP_GENE_LEVEL = new Argument(@"associate-proteins-gene-level-parsimony",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a name-only argument, do you think we should have an argument called "associate-proteins-parsimony-level" where the valid values are "protein" (default) or "gene"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that in concert with using a combobox in the Associate Proteins dialog rather than 2 checkboxes. It was easier to implement as just an additional checkbox, and I didn't see a significant reason to prefer a combobox, but if you see one then I'm happy to consider it.

pwiz_tools/Skyline/CommandArgs.cs Show resolved Hide resolved

return new FastaRecord(longestProtein.RecordIndex, 0,
new FastaSequence(longestProtein.Sequence.Name, longestProtein.Sequence.Description, null,
string.Join("", proteinToPeptides.Keys.Select(o => o.Sequence.Sequence))), longestProtein.Metadata);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brendanx67 This is what I came up with to pick/make a protein sequence that contains all the member peptides. It will use the longest sequence of the group if it contains all the peptides, and if it doesn't then it'll concatenate all the protein sequences together. Some alternatives might be: concatenate all the peptides instead of the proteins, or only concatenate enough proteins to contain all the peptides (i.e. a parsimonious approach to concatenating).

Or throw this hacky workaround away and go back to the drawing board, making gene group a new FastaSequenceGroup that allows its peptides to map to only some of the parent's FastaSequenceList. I honestly don't know how much work that would be though.

@chambm chambm linked an issue Nov 1, 2023 that may be closed by this pull request
chambm added 2 commits January 3, 2024 12:57
* bumped schema to 23.12 for gene level parsimony
* fixed Skyline build system to copy vendor DLLs to both Debug and Release directories so vendor reader tests don't fail in Debug config
* removed MinPeptidesPerProtein test from TestHugeAssociateProteins because it's now hidden in the Refine -> Associate Proteins dialog
@chambm chambm force-pushed the Skyline/work/20231025_gene_level_parsimony branch from 712319c to e725032 Compare January 3, 2024 18:02
@chambm
Copy link
Member Author

chambm commented Jan 3, 2024

@brendanx67 Good to go for this PR?

chambm and others added 3 commits January 9, 2024 16:55
…ptions

* greatly improved speed by using Nick's ReferenceValue for PeptideDocNodes
* fixed issues with cancellation in some parsimony calculations
…ocNode" because reference equality was not actually being used.

Check for cancellation in loop which digests every protein sequence.
Copy link
Contributor

@nickshulman nickshulman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice.

Here are some comments.
The only comments that I actually care about are the two where "Distinct" is being called on a list of PeptideDocNode objects.
Feel free to ignore all my other comments.

pwiz_tools/Skyline/EditUI/AssociateProteinsDlg.cs Outdated Show resolved Hide resolved
pwiz_tools/Skyline/Model/Proteome/ProteinAssociation.cs Outdated Show resolved Hide resolved
pwiz_tools/Skyline/Model/Proteome/ProteinAssociation.cs Outdated Show resolved Hide resolved
@@ -561,11 +610,65 @@ public void ApplyParsimonyOptions(bool groupProteins, bool findMinimalProteinLis
_finalResults.FinalSharedPeptideCount = sharedPeptidesRemaining.Values.Sum();
}

private Dictionary<IProteinRecord, PeptideAssociationGroup> CalculateProteinGroups(MappingResultsInternal results, ILongWaitBroker broker)
public class GeneLevelEqualityComparer : EqualityComparer<IProteinRecord>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of writing an EqualityComparer, I probably would have written a new class called "GeneOrProteinSequence" which I used as the Key in the dictionary.
The reason is that it's easier to understand what is going on if the dictionary has a completely different key type, compared to if it has a special EqualityComparer.

But, this way is fine, too.

Copy link
Member Author

@chambm chambm Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After I wrote this, I realized I could have just used the gene name (or protein name if gene isn't set) as the key. Does that seem sensible to you as well?

pwiz_tools/Skyline/Model/Proteome/ProteinAssociation.cs Outdated Show resolved Hide resolved
pwiz_tools/Skyline/Model/Proteome/ProteinAssociation.cs Outdated Show resolved Hide resolved
if (proteinToPeptides.Count == 1)
return proteinToPeptides.Keys.First();

var longestProtein = proteinToPeptides.OrderByDescending(kvp2 => kvp2.Key.Sequence.Sequence.Length).First().Key;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is checking whether the longest protein sequence contains all of the peptide sequences.
It's true that the longest protein sequence is the one that is most likely to contain all of the peptide sequences, but it is theoretically possible that a shorter protein sequence in the group might happen to contain all the peptide sequences.
(This probably doesn't matter).

@chambm chambm merged commit 072abd4 into master Jan 18, 2024
11 checks passed
@chambm chambm deleted the Skyline/work/20231025_gene_level_parsimony branch January 18, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skyline: .wiff tests failing - is this expected?
2 participants