Skip to content

Workflow

Robyn Wright edited this page Mar 29, 2025 · 49 revisions

Below is an overview of the PICRUSt2 workflow, which includes example commands for processing 16S sequencing data and getting E.C. number and KEGG ortholog (KO) abundances. The E.C. numbers can then be used to calculate MetaCyc pathway abundances and coverages. Note that there are other gene family databases supported which may be more informative (but which cannot be collapsed to pathways by default). See the side-bar for more details on individual commands.

Please note that as of PICRUSt2-v2.6.0 we have updated the PICRUSt2 database. We now refer to the current database as PICRUSt2-SC and the older database as PICRUSt2-oldIMG. You can see details for running PICRUSt2 v2.6.0 with the old database if you need to do that for comparability with previous studies, but we recommend using the latest database otherwise as it includes more genomes and updated functions. You can see details about the PICRUSt2-SC database here. If you are using an older version of PICRUSt2, the instructions are below.

Note that you can type the option -h to get a description of each below script.

Contents

Version 2.6.0 and later

Run the entire pipeline with the updated PICRUSt2-SC database (v2.6.0)

The entire pipeline can be run with this command:

picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline_split -p 1

If you would like to run each step individually you can also do that using the below commands. Using these commands is useful when you're running into problems using picrust2_pipeline.py and want to isolate an issue or if you only want to re-run part of the PICRUSt2 pipeline.

Run each step individually

1. Place sequences into reference phylogeny (v2.6.0)

More details

Place sequences into bacterial reference phylogeny:

place_seqs.py -s study_seqs.fna -o bac_placed_seqs.tre -p 1 \
                    -r bacteria \
                    --intermediate placement_working_bac

Place sequences into archaeal reference phylogeny:

place_seqs.py -s study_seqs.fna -o arc_placed_seqs.tre -p 1 \
                    -r archaea \
                    --intermediate placement_working_arc

Note that it is not necessary to run both of these if, for example, you know that your dataset should only contain bacteria.

2. Run hidden-state prediction to get 16S copy numbers and NSTI per predicted genome (v2.6.0)

More details

Note that NSTI values will be added to the 16S prediction table (since the -n option was given).

hsp.py -i 16S -r bacteria -t bac_placed_seqs.tre -o bac_marker_nsti_predicted.tsv.gz -p 1 -n

hsp.py -i 16S -r archaea -t arc_placed_seqs.tre -o arc_marker_nsti_predicted.tsv.gz -p 1 -n

3. Determine the best domain for each sequence (v2.6.0)

More details)

Now we want to check whether the NSTI was lower for the bacterial or the archaeal references for each of our study sequences:

pick_best_domain.py \
                    -n1 bacteria \
                    -n2 archaea \
                    --tree_dom1 bac_placed_seqs.tre \
                    --tree_dom2 arc_placed_seqs.tre \
                    --tree_out_dom1 bac_reduced_placed_seqs.tre \
                    --tree_out_dom2 arc_reduced_placed_seqs.tre \
                    --nsti_table_dom1 bac_marker_nsti_predicted.tsv.gz \
                    --nsti_table_dom2 arc_marker_nsti_predicted.tsv.gz \
                    --nsti_table_out_dom1 bac_reduced_marker_nsti_predicted.tsv.gz \
                    --nsti_table_out_dom2 arc_reduced_marker_nsti_predicted.tsv.gz \
                    --nsti_table_out_combined combined_marker_nsti_predicted.tsv.gz

4. Run hidden-state prediction to get EC number and KO abundances (or another trait) per predicted genome (v2.6.0)

More details

hsp.py -i EC -r bacteria -t bac_reduced_placed_seqs.tre -o bac_EC_predicted.tsv.gz -p 1

hsp.py -i KO -r bacteria -t bac_reduced_placed_seqs.tre -o bac_KO_predicted.tsv.gz -p 1

hsp.py -i EC -r archaea -t arc_reduced_placed_seqs.tre -o arc_EC_predicted.tsv.gz -p 1

hsp.py -i KO -r archaea -t arc_reduced_placed_seqs.tre -o arc_KO_predicted.tsv.gz -p 1

5. Combine files from hidden-state prediction (v2.6.0)

More details

combine_domains.py --table_dom1 bac_EC_predicted.tsv.gz --table_dom2 arc_EC_predicted.tsv.gz -o combined_EC_predicted.tsv.gz

combine_domains.py --table_dom1 bac_KO_predicted.tsv.gz --table_dom2 arc_KO_predicted.tsv.gz -o combined_KO_predicted.tsv.gz

6. Predict EC and KO abundances in samples (v2.6.0)

More details (Adjusts gene family abundances by 16S sequence abundance)

At this point, we have combined the predictions that we obtained separately for the bacterial and archaeal domains and are now running this in the same way as we did previously.

metagenome_pipeline.py -i study_seqs.biom \
                       -m combined_marker_nsti_predicted.tsv.gz \
                       -f combined_EC_predicted.tsv.gz \
                       -o EC_metagenome_out

metagenome_pipeline.py -i study_seqs.biom \
                       -m combined_marker_nsti_predicted.tsv.gz \
                       -f combined_KO_predicted.tsv.gz \
                       -o KO_metagenome_out

7. Infer MetaCyc pathway abundances and coverages (based on EC numbers) (v2.6.0)

More details

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
                    -o pathways_out \
                    --intermediate pathways_working \
                    -p 1

8. Add descriptions as new column in gene family and pathway abundance tables (v2.6.0)

More details

add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                    -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
                    -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
                    -o pathways_out/path_abun_unstrat_descrip.tsv.gz

Run the entire pipeline with the previous PICRUSt2-oldIMG database (v2.6.0)

This is essentially using v2.6.0 to replicate the behaviour of earlier versions.

If you want to use the previous oldIMG database, you can run the entire pipeline like this:

picrust2_pipeline_oldIMG.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline_split -p 1

The entire pipeline can be run with this command (details):

picrust2_pipeline_oldIMG.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline -p 1

If you would like to run each step individually you can also do that using the below commands. Using these commands is useful when you're running into problems using picrust2_pipeline_oldIMG.py and want to isolate an issue or if you only want to re-run part of the PICRUSt2 pipeline.

Run each step individually

1. Place sequences into reference phylogeny

More details

place_seqs.py -s study_seqs.fna -o placed_seqs.tre -p 1 \
              --intermediate placement_working --ref_dir oldIMG

2. Run hidden-state prediction to get 16S copy numbers, NSTI, EC number and KO abundances per predicted genome

More details

Note that NSTI values will be added to the 16S prediction table (since the -n option was given).

hsp.py -i 16S -t placed_seqs.tre -o marker_nsti_predicted.tsv.gz -p 1 -n -db oldIMG

hsp.py -i EC -t placed_seqs.tre -o EC_predicted.tsv.gz -p 1 -db oldIMG

hsp.py -i KO -t placed_seqs.tre -o KO_predicted.tsv.gz -p 1 -db oldIMG

3. Predict EC and KO abundances in samples

More details

metagenome_pipeline.py -i study_seqs.biom \
                       -m marker_nsti_predicted.tsv.gz \
                       -f EC_predicted.tsv.gz \
                       -o EC_metagenome_out


metagenome_pipeline.py -i study_seqs.biom \
                       -m marker_nsti_predicted.tsv.gz \
                       -f KO_predicted.tsv.gz \
                       -o KO_metagenome_out

4. Infer MetaCyc pathway abundances and coverages (based on EC numbers)

More details

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
                    -o pathways_out \
                    --intermediate pathways_working \
                    -p 1 \
                    -db oldIMG

5. Add descriptions as new column in gene family and pathway abundance tables

More details

add_descriptions_oldIMG.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                           -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions_oldIMG.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
                           -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions_oldIMG.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
                           -o pathways_out/path_abun_unstrat_descrip.tsv.gz

Version 2.5.3 and earlier

Run the entire pipeline (with PICRUSt2-oldIMG)

More details

The entire pipeline can be run with this command:

picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline -p 1

If you would like to run each step individually you can also do that using the below commands. Using these commands is useful when you're running into problems using picrust2_pipeline.py and want to isolate an issue or if you only want to re-run part of the PICRUSt2 pipeline.

Run each step individually

1. Place sequences into reference phylogeny

More details

place_seqs.py -s study_seqs.fna -o placed_seqs.tre -p 1 \
              --intermediate placement_working

2. Run hidden-state prediction to get 16S copy numbers, NSTI, EC number and KO abundances per predicted genome

More details

Note that NSTI values will be added to the 16S prediction table (since the -n option was given).

hsp.py -i 16S -t placed_seqs.tre -o marker_nsti_predicted.tsv.gz -p 1 -n

hsp.py -i EC -t placed_seqs.tre -o EC_predicted.tsv.gz -p 1

hsp.py -i KO -t placed_seqs.tre -o KO_predicted.tsv.gz -p 1

3. Predict EC and KO abundances in samples

More details (Adjusts gene family abundances by 16S sequence abundance)

metagenome_pipeline.py -i study_seqs.biom \
                       -m marker_nsti_predicted.tsv.gz \
                       -f EC_predicted.tsv.gz \
                       -o EC_metagenome_out


metagenome_pipeline.py -i study_seqs.biom \
                       -m marker_nsti_predicted.tsv.gz \
                       -f KO_predicted.tsv.gz \
                       -o KO_metagenome_out

4. Infer MetaCyc pathway abundances and coverages (based on EC numbers)

More details

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
                    -o pathways_out \
                    --intermediate pathways_working \
                    -p 1

5. Add descriptions as new column in gene family and pathway abundance tables

More details

add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                           -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
                           -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
                           -o pathways_out/path_abun_unstrat_descrip.tsv.gz

Shuffling predictions

An optional additional step is to shuffle the ASV labels in the genome prediction tables (i.e. the outputs of hsp.py). Any analyses based on these shuffled tables can then be compared with analyses based on the actual data to check if there is more signal in the unshuffled data. See here for more details.

Clone this wiki locally