-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protein comparative tool #9
Comments
Proteins downloaded from the ncbi-asn1/protein_fasta repo have organism names in [square brackets] at the end of the header. These appear to be unique for phage names e.g., [Mycobacterium phage Aminay], [Serratia phage Muldoon]. This could be a good bet for parsing out as an organism ID. FASTA headers from the protein_fasta repo have this form:
|
Also the ability to read BLASTXML output doesn't work, I think Anthony wrote it to only recognize XML2 which only our own Galaxy was set to make. If the tool could parse data out of XML output that might make it more portable. Here is 1 hit, the [organism] would show up in the <Hit_def> field. The tools needs to count each protein hit only 1 time (i.e, if your phage protein query hit a subject protein in 3 hsp's, that would only count as 1 hit). /XML |
example blastp output which would be input for the tool: Galaxy6-[blastp_Peptide_sequences_from_Apollo_vs___protein_BLAST_database_from_data_3__].txt |
@jasonjgill I have made a new tool with the following output from your above dataset:
I reduced the complexity from the existing tool to more-or-less the "group the hits by name" requirement. When you confirm this looks "correct", I will move on and finish the wrapping. Q: How/What is it doing: |
Is the input for this just the protein FASTA file? I can gin up a test organism with a known output to validate it |
No, it is the file above (Galaxy6) Yeah, let me know |
Addresses the need to group by the organism found in the brackets. It does the following: * Reads a tab-delimited BLAST output file. * Extracts organism names from the subject titles (text in square brackets). * Counts unique query proteins that matched each organism and unique hit proteins from each organism. * Sorts and displays results based on either unique queries or unique hits. * The output is a formatted table showing the top N organisms with the most matches. [#9]
Makes rank "1-based" instead of "0-based" [#9]
Hey Curtis that output looks correct based on the inputs I used, can you wrap it? Also was this working with XML (XML1) an option? If not that option should get removed from the wrapper. |
It is wrapped 👍 (part of this work) And I will clean it up |
From Jason:
Is the tool very specific for only the NCBI DB format?
Problem: is organism reliably identified in the headers of proteins?
Retool for Uniprot protein headers? These seem to always contain organism info in header
Headers from the CPT Galaxy databases (retrieved from the NCBI ftp repo??) work with the tool, they contain explicit organism names in the headers, last field in the header in [square brackets]
It is possible to download only phage protein datasets from NCBI:
Will try this out on usegalaxy.eu
This is a good tool for phages with little to no DNA identity, still useful
The text was updated successfully, but these errors were encountered: