-
Notifications
You must be signed in to change notification settings - Fork 7
Running recentrifuge for Centrifuge
Let's suppose you have installed Recentrifuge with pip
or cloned the repo in ~/recentrifuge
, and you would like to analyze and compare the Centrifuge output from samples S1, S2 and S3, for instance. Provided you have already used retaxdump
to populate ./taxdump
, the command would be:
~/recentrifuge/rcf -f S1.out -f S2.out -f S3.out
where you should exclude the ~/recentrifuge/
part for pip
installations in this and future command-lines.
Sometimes you have a lot of samples and you just want to "recentrifuge" all of them. If they are in the directory my_outputs_dir
, you achieve that with the following line:
~/recentrifuge/rcf -f my_outputs_dir
Recentrifuge reads the direct raw output from Centrifuge and shows diverse descriptive statistics. For example, for a rare environmental sample "centrifuged" with minimum hit length (MHL) set to 35 and "recentrifuged" with MHL filtered to 50, the Recentrifuge console output is:
Loading output file EnvSample_mhl35_k1_cf.out... OK!
Seqs read: 2_828_017 [698.21 Mnt]
Seqs clas: 468_295 (83.44% unclassified)
Seqs pass: 314_510 (32.84% rejected)
Scores: min = 50.0, max = 207.3, avr = 105.4
Length: min = 80 nt, max = 302 nt, avr = 258 nt
2796 taxa with assigned reads
Building from raw data... EnvSample_mhl35_k1_cf sample OK!
Load elapsed time: 3.6 sec
Please do not use additional options with Centrifuge (e.g., --verbose
) that affect the output format, because they will add lines that Recentrifuge will not understand. In case you already have a Centrifuge output file generated with the --verbose
flag, you may just remove the extra lines before the header of the output table and Recentrifuge/Rextract will be able to successfully parse it.
There are different options to score the reads classified by Centrifuge, which could be selected with the option -s/--scoring. Recentrifuge supports the following scoring schemes for Centrifuge:
- SHEL (Single Hit Equivalent Length): This is a score value roughly equivalent to the length in pair bases of a single hit to the database. It is calculated as the square root of the Centrifuge score, plus 15. This is currently the default scoring scheme for Centrifuge data in Recentrifuge.
- LENGTH: The score of a read will be its length (or the combined length of mate pairs).
- LOGLENGTH: Logarithm (base 10) of the length score.
- NORMA: This score is the normalized score SHEL / LENGTH in percentage, so it takes into account both the assignment quality and the length of the read.
The last three scoring schemes are very useful when there are reads with a diverse order of magnitude in length, like in nanopore sequencing.
For every scoring scheme, the minscore
parameter works for the calculated SHEL score of the read. So, for example, a minscore
of 35 (indicated with the -y 35
option) will filter the same reads independently of the scoring scheme selected.
Let's review a more complex example in depth: to analyze the Centrifuge output:
- from samples X1 (file X1.nt_mhl30_k1_cf.out), X2 (file X2.nt_mhl30_k1_cf.out) and X3 (file X3.nt_mhl30_k1_cf.out),
- with ONE negative control (file CTRL.nt_mhl30_k1_cf.out),
- but excluding taxa assigned to chordata (taxid 7711), unclassified sequences (taxid 12908) and other sequences (taxid 28384),
- with the taxonomy files downloaded to /my/tax/dir,
- and saving the output to Xsamples.rcf.html file (and to Xsamples.rcf.xls),
the command would be:
~/recentrifuge/rcf -n /my/tax/dir -c 1 -f CTRL.nt_mhl30_k1_cf.out -f X1.nt_mhl30_k1_cf.out -f X2.nt_mhl30_k1_cf.out -f X3.nt_mhl30_k1_cf.out -x 7711 -x 12908 -x 28384 -o Xsamples.rcf.html
The complete guide to rcf
options and flags is in the Recentrifuge command line page.
If you use Recentrifuge in your research, please consider citing the paper. Thanks!
Martí JM (2019) Recentrifuge: Robust comparative analysis and contamination removal for metagenomics. PLOS Computational Biology 15(4): e1006967. https://doi.org/10.1371/journal.pcbi.1006967