To get an XMFA file that works in ClonalFrame by Xavier Didelot, the original XMFA output from MAUVE multiple genome alignment software will not work.
MAUVE is maintained by Aaron Darling ("koadman") and it's great software for aligning MANY bacterial (or smaller) genomes, and can do large genomes as well. The MAUVE output includes a non-standard formatted XMFA file which is difficult to use for downstream genomic recombination analyses. ClonalFrame and ClonalOrigin are a powerful suite of tools to identify microevolution maintained by by Xavier Didelot.
First: Keep all your files in one directory. It's just easier that way.
Second: Get ClonalOrigin and all it's associated programs as described HERE.
START:
-
Align genomes with Progressive Mauve
-
Using the Progressive Mauve outputs, run StripSubsetsLCB as described here using the MAUVE output .xmfa and .bbcols files. It should look something like:
** "stripSubsetLCBs full_alignment.xmfa full_alignment.xmfa.bbcols core_alignment.xmfa 500" **
-
the StripsubsetsLCB output new XMFA file (not the original from MAUVE) now has only the CORE alignment region where all the lines and line-lengths are correct.
However, the header lines often have additional information such as genome position numbers that must be removed, leaving only the organism name (e.g. >S. aureus) for downstream analyses.
if and ONLY if, your headers look like mine (for example):
>2:3289121-3291310 + e.anophelisNUHP1.fas
As shown below, sed can be used in short script to remove the genome position numbers up to the organism name on the StripsubsetsLCB output XMFA file:
sed -r 's/^>.* />/' your_xmfa_file.xmfa
The result header from the example above now would be :
>e.anophelisNUHP1.fas
I'm not a great coder so please feel free to improve on this one-liner (thanks Dana). It seems to work for me.
Xavier Didelot suggested the following Perl script based on my initial scripting efforts:
perl -i.bk -wpe's/^>..* -*/>/' your_xmfa_file.xmfa
It is very likely that
perl -i.bk -wpe's/^>.* />/' your_xmfa_file.xmfa
Will work just as well. The next step is to infer clonal geneology which will take several days of compute time, and leave you with a consensus tree.
If you find some of your warg
jobs are running slow, see my additional thoughts on Thoughts_about_slow_warg
_jobs
in Clonal Frame analyses