Skip to content

Running recentrifuge for Centrifuge

Jose Manuel Martí edited this page Jul 9, 2018 · 11 revisions

Quick start

Let's suppose you have cloned the repo in ~/recentrifuge and you would like to analyze and compare the Centrifuge output from samples S1, S2 and S3, for instance. Provided you have already used retaxdump to populate ./taxdump, the command would be:

~/recentrifuge/recentrifuge.py -f S1.out -f S2.out -f S3.out

Sometimes you have a lot of samples and you just want to "recentrifuge" all of them. If they are in the directory my_outputs_dir, you achieve that with the following line:

~/recentrifuge/recentrifuge.py -f my_outputs_dir

Details

File format

Recentrifuge reads the direct output from Centrifuge and shows diverse descriptive statistics. For example, for a rare environmental sample "centrifuged" with minimum hit length (MHL) set to 35 and "recentrifuged" with MHL filtered to 50, the Recentrifuge console output is:

Loading output file EnvSample_mhl35_k1_cf.out... OK!
  Seqs read: 2_828_017	[698.21 Mnt]
  Seqs clas: 468_295	(83.44% unclassified)
  Seqs pass: 314_510	(32.84% rejected)
  Scores: min = 50.0, max = 207.3, avr = 105.4
  Length: min = 80 nt, max = 302 nt, avr = 258 nt
  2796 taxa with assigned reads
Building from raw data... EnvSample_mhl35_k1_cf sample OK!
Load elapsed time: 3.6 sec

Scoring schemes

There are different options to score the reads classified by Centrifuge, which could be selected with the option -s/--scoring. Recentrifuge supports the following scoring schemes for Centrifuge:

  • SHEL (Single Hit Equivalent Length): This is a score value roughly equivalent to the length in pair bases of a single hit to the database. It is calculated as the square root of the Centrifuge score, plus 15. This is currently the default scoring scheme for Centrifuge data in Recentrifuge.
  • LENGTH: The score of a read will be its length (or the combined length of mate pairs).
  • LOGLENGTH: Logarithm (base 10) of the length score.
  • NORMA: This score is the normalized score SHEL / LENGTH, so it takes into account both the assignment quality and the length of the read.

The last three scoring schemes are very useful when there are reads with a diverse order of magnitude in length, like in nanopore sequencing.

For every scoring scheme, the minscore parameter works for the calculated SHEL score of the read. So, for example, a minscore of 35 (indicated with the -y 35 option) will filter the same reads independently of the scoring scheme selected.

Advanced example

Let's review a more complex example in depth: to analyze the Centrifuge output:

  • from samples X1 (file X1.nt_mhl30_k1_cf.out), X2 (file X2.nt_mhl30_k1_cf.out) and X3 (file X3.nt_mhl30_k1_cf.out),
  • with ONE negative control (file CTRL.nt_mhl30_k1_cf.out),
  • but excluding taxa assigned to chordata (taxid 7711), unclassified sequences (taxid 12908) and other sequences (taxid 28384),
  • with the taxonomy files downloaded to /my/tax/dir,
  • and saving the output to Xsamples.rcf.html file (and to Xsamples.rcf.xls),

the command would be:

~/recentrifuge/recentrifuge.py -n /my/tax/dir -c 1 -f CTRL.nt_mhl30_k1_cf.out -f X1.nt_mhl30_k1_cf.out -f X2.nt_mhl30_k1_cf.out -f X3.nt_mhl30_k1_cf.out -x 7711 -x 12908 -x 28384 -o Xsamples.rcf.html

The complete guide to the recentrifuge options and flags is in the Recentrifuge command line page.