Running recentrifuge for Centrifuge

Quick start

Let's suppose you have cloned the repo in ~/recentrifuge and you would like to analyze and compare the Centrifuge output from samples S1, S2 and S3, for instance. Provided you have already used retaxdump to populate ./taxdump, the command would be:

~/recentrifuge/recentrifuge.py -f S1.out -f S2.out -f S3.out

Sometimes you have a lot of samples and you just want to "recentrifuge" all of them. If they are in the directory my_outputs_dir, you achieve that with the following line:

~/recentrifuge/recentrifuge.py -f my_outputs_dir

Details

File format

Recentrifuge reads the direct output from Centrifuge and shows diverse descriptive statistics. For example, for a rare environmental sample "centrifuged" with minimum hit length (MHL) set to 35 and "recentrifuged" with MHL filtered to 50, the Recentrifuge console output is:

Loading output file EnvSample_mhl35_k1_cf.out... OK!
  Seqs read: 2_828_017	[698.21 Mnt]
  Seqs clas: 468_295	(83.44% unclassified)
  Seqs pass: 314_510	(32.84% rejected)
  Scores: min = 50.0, max = 207.3, avr = 105.4
  Length: min = 80 nt, max = 302 nt, avr = 258 nt
  2796 taxa with assigned reads
Building from raw data... EnvSample_mhl35_k1_cf sample OK!
Load elapsed time: 3.6 sec

Scoring schemes

There are different options to score the reads classified by Centrifuge, which could be selected with the option -s/--scoring. Recentrifuge supports the following scoring schemes for Centrifuge:

SHEL (Single Hit Equivalent Length): This is a score value roughly equivalent to the length in pair bases of a single hit to the database. It is calculated as the square root of the Centrifuge score, plus 15. This is currently the default scoring scheme for Centrifuge data in Recentrifuge.
LENGTH: The score of a read will be its length (or the combined length of mate pairs).
LOGLENGTH: Logarithm (base 10) of the length score.
NORMA: This score is the normalized score SHEL / LENGTH, so it takes into account both the assignment quality and the length of the read.

The last three scoring schemes are very useful when there are reads with a diverse order of magnitude in length, like in nanopore sequencing.

For every scoring scheme, the minscore parameter works for the calculated SHEL score of the read. So, for example, a minscore of 35 (indicated with the -y 35 option) will filter the same reads independently of the scoring scheme selected.

Advanced example

Let's review a more complex example in depth: to analyze the Centrifuge output:

from samples X1 (file X1.nt_mhl30_k1_cf.out), X2 (file X2.nt_mhl30_k1_cf.out) and X3 (file X3.nt_mhl30_k1_cf.out),
with ONE negative control (file CTRL.nt_mhl30_k1_cf.out),
but excluding taxa assigned to chordata (taxid 7711), unclassified sequences (taxid 12908) and other sequences (taxid 28384),
with the taxonomy files downloaded to /my/tax/dir,
and saving the output to Xsamples.rcf.html file (and to Xsamples.rcf.xls),

the command would be:

~/recentrifuge/recentrifuge.py -n /my/tax/dir -c 1 -f CTRL.nt_mhl30_k1_cf.out -f X1.nt_mhl30_k1_cf.out -f X2.nt_mhl30_k1_cf.out -f X3.nt_mhl30_k1_cf.out -x 7711 -x 12908 -x 28384 -o Xsamples.rcf.html

The complete guide to the recentrifuge options and flags is in the Recentrifuge command line page.

If you use Recentrifuge in your research, please consider citing the paper. Thanks!

Martí JM (2019) Recentrifuge: Robust comparative analysis and contamination removal for metagenomics. PLOS Computational Biology 15(4): e1006967. https://doi.org/10.1371/journal.pcbi.1006967

Provide feedback

Saved searches