Add genome symlink to allow different genome names without mapping #5

brainstorm · 2012-03-04T14:38:41Z

So as to solve the following situation:

b97pla@f5f59a5

That re-downloads the same data on different directories and re-runs all the alignments.

As @b97pla discussed:

(...) what is presented to the user does not need to match what goes into the pipeline (i.e. the user could be presented with e.g. "Arabidopsis thaliana (tair9)" but your application could still write "araTha_tair9" to the csv, which would guarantee no disruption to the pipeline), or am I missing the point?

The problem is that the code pointed to by Valentine is not the only place where this is used, e.g. bcbio/pipeline/alignment.py also uses this value but without having any alias hash. There are of course many places where we can handle this: in the samplesheet generator, in the csv2yaml conversion, with aliases when fetching the reference file or with multiple entries in the reference mapping .loc-file. Implementing an alias hash is probably the most flexible and future-proof solution. I'll log this as an issue.

I've been thinking that defining the following structure in biodata.yml would sove the issue:

genomes:
  - dbkey: araTha_tair9
    name: Arabidopsis thaliana (TAIR9)
  - dbkey: tair9
    name: Arabidopsis thaliana (TAIR9)
    type: symlink_to(araTha_tair9)

Then, symlink accordingly on the filesystem, instead of re-downloading the same genomes.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add genome symlink to allow different genome names without mapping #5

Add genome symlink to allow different genome names without mapping #5

brainstorm commented Mar 4, 2012

Add genome symlink to allow different genome names without mapping #5

Add genome symlink to allow different genome names without mapping #5

Comments

brainstorm commented Mar 4, 2012