Protein comparative tool #9

curtisim0 · 2024-06-07T21:25:51Z

From Jason:

Is the tool very specific for only the NCBI DB format?

Problem: is organism reliably identified in the headers of proteins?

Retool for Uniprot protein headers? These seem to always contain organism info in header

“OS=” is organism scientific name, may not be unique if it is only a species name
“OX=” is NCBI taxid, may not be unique if it only points to species

Headers from the CPT Galaxy databases (retrieved from the NCBI ftp repo??) work with the tool, they contain explicit organism names in the headers, last field in the header in [square brackets]

It is possible to download only phage protein datasets from NCBI:

https://ftp.ncbi.nlm.nih.gov/ncbi-asn1/protein_fasta/gbphg*.fsa_aa.gz

Will try this out on usegalaxy.eu

This is a good tool for phages with little to no DNA identity, still useful

jasonjgill · 2024-06-07T22:28:49Z

Proteins downloaded from the ncbi-asn1/protein_fasta repo have organism names in [square brackets] at the end of the header. These appear to be unique for phage names e.g., [Mycobacterium phage Aminay], [Serratia phage Muldoon]. This could be a good bet for parsing out as an organism ID. FASTA headers from the protein_fasta repo have this form:

protein_accession protein name [organism name]
MXXXXXXXXX

AIX32998.1 hypothetical protein Syn7803US50_14 [Synechococcus phage ACG-2014f]
MAEQNWERRILQSFGANFRLDVSNPQKTVGGEDVYNFYSVTDEEKVCLMGQQQDGLWRLYNDDKVEIVGG
AKVVEDGVCVTIVGKNGDVVINADNNGRVRIRGQNINLQADEDVNITAGRNVNIKSGSGRTLLAGNTLEK
DALKGNLLDPEKQWAWRVFEGTGLPAGMFPQLMSPFSGITDLAGSIVGGVGFGDAISGAVSSAVSGAVSG

WPJ71242.1 RNA polymerase sigma factor for late transcription [Escherichia phage vB-Eco-KMB39]
MSETKPKYNYVNNKELLQAIIDWKTELANNKDPNKVVRQNDTIGLAIMLIAEGLSKRFNFSGYTQSWKQE
MIADGIEASIKGLHNFDETKYKNPHAYITQACFNAFVQRIKKERKEVAKKYSYFVHNVYDSRDDDMVALV
DETFIQDIYDKMTHYEESTYRTPGAEKKSVVDDSPSLDFLYEAND

jasonjgill · 2024-06-07T22:40:49Z

Also the ability to read BLASTXML output doesn't work, I think Anthony wrote it to only recognize XML2 which only our own Galaxy was set to make. If the tool could parse data out of XML output that might make it more portable. Here is 1 hit, the [organism] would show up in the <Hit_def> field. The tools needs to count each protein hit only 1 time (i.e, if your phage protein query hit a subject protein in 3 hsp's, that would only count as 1 hit).

/XML

<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>Query_1</Iteration_query-ID>
<Iteration_query-def>99974bdd-2c83-421a-b7a3-dab7d44153ab (mRNA) 270 residues [Milagro:3193-4021 + strand] [peptide] name=Milagro.orf00003-00001-00001</Iteration_query-def>
<Iteration_query-len>270</Iteration_query-len>
<Iteration_hits>

<Hit_num>1</Hit_num>
<Hit_id>gnl|BL_ORD_ID|1190084</Hit_id>
<Hit_def>UNY41722.1 capsid scaffolding protein [Burkholderia phage Milagro]</Hit_def>
<Hit_accession>1190084</Hit_accession>
<Hit_len>270</Hit_len>
<Hit_hsps>

<Hsp_num>1</Hsp_num>
<Hsp_bit-score>551.206</Hsp_bit-score>
<Hsp_score>1419</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>270</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>270</Hsp_hit-to>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>0</Hsp_hit-frame>
<Hsp_identity>270</Hsp_identity>
<Hsp_positive>270</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>270</Hsp_align-len>
<Hsp_qseq>MATNKTKFFRVAVEGATVDGREIKREWLTQMAKNYNRELYGARLNIEHLKGWAPLSATNPFGAYGDVIALKASEIEDGPLKGKMGLYAQLDPTDELVALSKKRQKVFTSIEVNPDFADIGEAYLVGLAATDDPASLGTEALQFAARRSNNLFSAACETSIEFEGEPESTSLLSIVKGMFARNRSTDDQRDADVRHAVEEIAGFASQQGRDVAALRVDLTAAQQDAAAAKKRADEAVAAVEALTAKLSATDNGAPRRQPSTGSTGELVTDC</Hsp_qseq>
<Hsp_hseq>MATNKTKFFRVAVEGATVDGREIKREWLTQMAKNYNRELYGARLNIEHLKGWAPLSATNPFGAYGDVIALKASEIEDGPLKGKMGLYAQLDPTDELVALSKKRQKVFTSIEVNPDFADIGEAYLVGLAATDDPASLGTEALQFAARRSNNLFSAACETSIEFEGEPESTSLLSIVKGMFARNRSTDDQRDADVRHAVEEIAGFASQQGRDVAALRVDLTAAQQDAAAAKKRADEAVAAVEALTAKLSATDNGAPRRQPSTGSTGELVTDC</Hsp_hseq>
<Hsp_midline>MATNKTKFFRVAVEGATVDGREIKREWLTQMAKNYNRELYGARLNIEHLKGWAPLSATNPFGAYGDVIALKASEIEDGPLKGKMGLYAQLDPTDELVALSKKRQKVFTSIEVNPDFADIGEAYLVGLAATDDPASLGTEALQFAARRSNNLFSAACETSIEFEGEPESTSLLSIVKGMFARNRSTDDQRDADVRHAVEEIAGFASQQGRDVAALRVDLTAAQQDAAAAKKRADEAVAAVEALTAKLSATDNGAPRRQPSTGSTGELVTDC</Hsp_midline>

</Hit_hsps>

jasonjgill · 2024-06-28T19:33:43Z

Galaxy17-[Top_BlastP_Hits].txt

jasonjgill · 2024-06-28T19:34:11Z

Galaxy16-[Galaxy2-[BLASTp_all_phages_comparison].tabular].txt

jasonjgill · 2024-06-28T20:42:26Z

example blastp output which would be input for the tool:

Galaxy6-[blastp_Peptide_sequences_from_Apollo_vs___protein_BLAST_database_from_data_3__].txt

curtisim0 · 2024-07-14T01:26:24Z

@jasonjgill I have made a new tool with the following output from your above dataset:

❯ python protein_blast_grouping.py test-data/blast-input.txt --hits 20
# Top 20 Hits
# Name                                             Unique Query Matches      Unique Subject Hits      
Burkholderia phage Milagro                         47                        48                       
Burkholderia phage Momento                         41                        45                       
Burkholderia phage Musica                          39                        42                       
Burkholderia phage Menos                           39                        40                       
Burkholderia phage KL3                             38                        39                       
Burkholderia phage PhiBP82.2                       35                        35                       
Burkholderia phage PhiBP82.3                       34                        34                       
Burkholderia phage phiE202                         34                        34                       
Burkholderia phage phiE094                         34                        34                       
Burkholderia phage phiX216                         33                        33                       
Burkholderia phage phiE52237                       33                        33                       
Burkholderia phage AP3                             33                        33                       
Burkholderia phage Carl1                           33                        34                       
Burkholderia phage Mana                            33                        34                       
Burkholderia phage vB_HM387                        32                        32                       
Burkholderia phage BEK                             31                        31                       
Burkholderia phage KS5                             31                        32                       
Ralstonia phage RsoM1USA                           28                        28                       
Ralstonia phage RSA1                               28                        29                       
Burkholderia phage PK23                            26                        27

I reduced the complexity from the existing tool to more-or-less the "group the hits by name" requirement.

When you confirm this looks "correct", I will move on and finish the wrapping.

Q: How/What is it doing:
"Unique Query Matches" tells you how many of your query proteins had at least one match in each organism.
"Unique Subject Hits" tells you how many unique proteins from each organism were matched by any of your queries.

jasonjgill · 2024-07-18T00:59:14Z

Is the input for this just the protein FASTA file? I can gin up a test organism with a known output to validate it

curtisim0 · 2024-07-19T03:55:00Z

No, it is the file above (Galaxy6)

Yeah, let me know

Addresses the need to group by the organism found in the brackets. It does the following: * Reads a tab-delimited BLAST output file. * Extracts organism names from the subject titles (text in square brackets). * Counts unique query proteins that matched each organism and unique hit proteins from each organism. * Sorts and displays results based on either unique queries or unique hits. * The output is a formatted table showing the top N organisms with the most matches. [#9]

[#9]

Makes rank "1-based" instead of "0-based" [#9]

jasonjgill · 2024-08-16T22:39:14Z

Hey Curtis that output looks correct based on the inputs I used, can you wrap it? Also was this working with XML (XML1) an option? If not that option should get removed from the wrapper.

curtisim0 · 2024-08-19T22:30:20Z

It is wrapped 👍 (part of this work)

And I will clean it up

curtisim0 mentioned this issue Jul 24, 2024

i9 // Parsing Improvements for BLASTp #24

Merged

curtisim0 added a commit that referenced this issue Aug 8, 2024

Writes to file instead of stdout

6dde4ec

[#9]

curtisim0 added a commit that referenced this issue Aug 8, 2024

Adds missing python call

e3582c9

[#9]

curtisim0 added a commit that referenced this issue Aug 8, 2024

Adds missing headers for file output

120ca2f

[#9]

curtisim0 added a commit that referenced this issue Aug 8, 2024

Uses tabs

1098c37

[#9]

curtisim0 added a commit that referenced this issue Aug 8, 2024

Removes header

12a5896

Makes rank "1-based" instead of "0-based" [#9]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein comparative tool #9

Protein comparative tool #9

curtisim0 commented Jun 7, 2024

jasonjgill commented Jun 7, 2024 •

edited

Loading

jasonjgill commented Jun 7, 2024 •

edited

Loading

jasonjgill commented Jun 28, 2024

jasonjgill commented Jun 28, 2024

jasonjgill commented Jun 28, 2024

curtisim0 commented Jul 14, 2024

jasonjgill commented Jul 18, 2024

curtisim0 commented Jul 19, 2024

jasonjgill commented Aug 16, 2024

curtisim0 commented Aug 19, 2024

Protein comparative tool #9

Protein comparative tool #9

Comments

curtisim0 commented Jun 7, 2024

jasonjgill commented Jun 7, 2024 • edited Loading

jasonjgill commented Jun 7, 2024 • edited Loading

jasonjgill commented Jun 28, 2024

jasonjgill commented Jun 28, 2024

jasonjgill commented Jun 28, 2024

curtisim0 commented Jul 14, 2024

jasonjgill commented Jul 18, 2024

curtisim0 commented Jul 19, 2024

jasonjgill commented Aug 16, 2024

curtisim0 commented Aug 19, 2024

jasonjgill commented Jun 7, 2024 •

edited

Loading

jasonjgill commented Jun 7, 2024 •

edited

Loading