Skip to content
Jose Manuel Martí edited this page Sep 26, 2017 · 30 revisions

Recentrifuge

Robust comparative analysis and contamination removal for metagenomic data

BioRxiv (2017) doi: https://doi.org/10.1101/190934


Welcome to the Recentrifuge wiki!

Installation

Overview

  • Requirements: Python 3.6 is required. No modules beyond Python Standard Library ones are currently used by Recentrifuge.
  • Getting the code: just clone the repo.
  • Getting the databases: in the cloning dir, execute retaxdump.py. It will download and unzip the required local databases from NCBI servers under the subdirectory taxdump.

Details

Python version under 3.6 is no supported as Recentrifuge uses new syntax features of Python 3.6 like syntax for variable annotations (PEP 526) and formatted string literals (PEP 498). The syntax for type annotations was introduced in Python 3.5 (PEP 484) but it is with Python 3.6 when it has achieved maturity for variable annotations. Powerful tools for static type analysis in Python have evolved along with these standards. The development of Recentrifuge includes checks with pylint and mypy. A code whose aim is to perform robust comparative metagenomic analysis is a very good candidate for robust coding.

Running

Overview

Let's suppose you have cloned the repo in ~/recentrifuge and you would like to analyse and compare the Centrifuge output from samples S1, S2 and S3, for instance. The command would be:

~/recentrifuge/recentrifuge.py -f S1.out -f S2.out -f S3.out

Details

The layout of the Recentrifuge command is:

usage: recentrifuge.py [-h] [-V] (-f FILE | -r FILE) [-v] [-n PATH] [-m INT]
                       [-k] [-o FILE] [-i TAXID] [-x TAXID] [-s SCORING]
                       [--sequential] [-a] [-c]

with the following optional arguments:

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -f FILE, --file FILE  Centrifuge output files (multiple -f is available to
                        include several samples in plot) (default: None)
  -r FILE, --report FILE
                        Centrifuge/Kraken report files (multiple -r is
                        available to include several samples in plot)
                        (default: None)
  -v, --verbose         increase output verbosity (default: 0)
  -n PATH, --nodespath PATH
                        path for the nodes information files (nodes.dmp and
                        names.dmp from NCBI) (default: ./taxdump)
  -m INT, --mintaxa INT
                        minimum taxa to avoid collapsing one level to the
                        parent one (default: 10)
  -k, --nokollapse      show the "cellular organisms" taxon (default: False)
  -o FILE, --outhtml FILE
                        HTML output file (if not given the filename will be
                        inferred from input files) (default: None)
  -i TAXID, --include TAXID
                        NCBI taxid code to include a taxon and all underneath
                        (multiple -i is available to include several taxid).
                        By default all the taxa is considered for inclusion.
                        (default: [])
  -x TAXID, --exclude TAXID
                        NCBI taxid code to exclude a taxon and all underneath
                        (multiple -x is available to exclude several taxid)
                        (default: [])
  -s SCORING, --scoring SCORING
                        type of scoring to be applied, and can be one of
                        ['SHEL', 'LENGTH', 'LOGLENGTH', 'NORMA'] (default:
                        SHEL)
  --sequential          deactivate parallel processing (default: False)
  -a, --avoidcross      avoid cross analysis (default: False)
  -c, --control         take the first sample as negative control (default:
                        False)

For example, to analyse the Centrifuge output:

  • from samples X1 (file X1.nt_mhl30_k1_cf.out), X2 (file X2.nt_mhl30_k1_cf.out) and X3 (file X3.nt_mhl30_k1_cf.out),
  • with negative control CTRL (file CTRL.nt_mhl30_k1_cf.out),
  • but excluding taxa assigned to chordata (taxid 7711), unclassified sequences (taxid 12908) and other sequences (taxid 28384),
  • with the taxonomy files downloaded to /my/tax/dir,
  • and saving the output to Xsamples.rcf.html file,

the command would be:

~/recentrifuge/recentrifuge.py -n /my/tax/dir -c -f CTRL.nt_mhl30_k1_cf.out -f X1.nt_mhl30_k1_cf.out -f X2.nt_mhl30_k1_cf.out -f X3.nt_mhl30_k1_cf.out -x 7711 -x 12908 -x 28384 -o Xsamples.rcf.html

Sorry, documentation in preparation. Expect this page to change often to accommodate new material.


If you find the code useful and use it in your research, please consider to cite the pre-print (https://doi.org/10.1101/190934). Thanks!