Change file input and output from sanatised organism name to (sanatized) organism ID #48

JamesRH · 2013-05-22T17:05:24Z

Besides the changes we discussed in replaceOrgWithAbbrev.py, other files use organism names in their output or input.

in src/makeCoreClusterAnalysisTree.py, the input and output use sanitized organism names:
"The input MUST be a Newick file with organism IDs REPLACED with their names"
"WARNING: Organism name %s in the database was not found in the provided tree. It will be deleted!!\n" %(collist[ii]))

The description and the header comment in this file conflict about the function of the script:
src/db_getBlastResultsBetweenSpecificGenes.py
description = "Given list of genes to match, returns a list of BLAST results between genes in the list only"

Provide a list of organisms to match [can match any portion of the organism so if you give it just "mazei" it will return to you a list of Methanosarcina mazei]

I think this is from duplication between thses scripts:
src/db_getBlastResultsBetweenSpecificGenes.py src/db_getBlastResultsBetweenSpecificOrganisms.py

Other scripts to check if the organism name or ID are used:
db_findClustersByOrganismList.py
db_getOrganismsInClusterRun.py
db_getOrganismsInCluster.py
db_addOrganismNameToTable.py
db_bidirectionalBestHits.py
db_TBlastN_wrapper.py

We discussed keeping the library functions, but another way to find the dependences is to see what called these library functions:
lib/TreeFuncs.py: '''Parse a node name into an organism ID.
lib/ClusterFuncs.py: Given an organism name, return the ID for that organism name.
lib/CoreGeneFunctions.py: The return object is a list of (runid, clusterid, organism) tuples sorted by run ID then by cluster ID.'''
lib/CoreGeneFunctions.py:def findGenesByOrganismList(orglist
lib/CoreGeneFunctions.py: The organisms in "orglist" are considered the "ingroup"

JamesRH · 2013-05-22T19:10:34Z

Currently db_getOrganismsInClusterRun.py and db_getOrganismsInCluster.py
return unsanitized organism names.

mattb112885 · 2013-05-29T19:52:39Z

This is a mess but I'll take this suggestion based on our discussion. Then we will just have one script that converts to user-readable IDs at the end, correct? (Also, when we're building figures we should have the option to do that automatically since they're made to be looked at and not computed on)

JamesRH · 2013-05-29T21:18:11Z

That sounds good to me. I'll commit my multi-format parser /lib/
function if is not already in the push request.

James H

On 05/29/2013 02:52 PM, mattb112885 wrote:

This is a mess but I'll take this suggestion based on our discussion.
Then we will just have one script that converts to user-readable IDs
at the end, correct? (Also, when we're building figures we should have
the option to do that automatically since they're made to be looked at
and not computed on)

—
Reply to this email directly or view it on GitHub
#48 (comment).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change file input and output from sanatised organism name to (sanatized) organism ID #48

Change file input and output from sanatised organism name to (sanatized) organism ID #48

JamesRH commented May 22, 2013

JamesRH commented May 22, 2013

mattb112885 commented May 29, 2013

JamesRH commented May 29, 2013

Change file input and output from sanatised organism name to (sanatized) organism ID #48

Change file input and output from sanatised organism name to (sanatized) organism ID #48

Comments

JamesRH commented May 22, 2013

Provide a list of organisms to match [can match any portion of the organism so if you give it just "mazei" it will return to you a list of Methanosarcina mazei]

JamesRH commented May 22, 2013

mattb112885 commented May 29, 2013

JamesRH commented May 29, 2013