-
Notifications
You must be signed in to change notification settings - Fork 108
Workflow
Below is an overview of the PICRUSt2 workflow, which includes example commands for processing 16S sequencing data and getting E.C. number and KEGG ortholog (KO) abundances. The E.C. numbers can then be used to calculate MetaCyc pathway abundances and coverages. Note that there are other gene family databases supported which may be more informative (but which cannot be collapsed to pathways by default). See the side-bar for more details on individual commands.
Please note that as of PICRUSt2-v2.6.0 we have updated the PICRUSt2 database. We now refer to the current database as PICRUSt2-SC and the older database as PICRUSt2-oldIMG. You can see details for running PICRUSt2 v2.6.0 with the old database if you need to do that for comparability with previous studies, but we recommend using the latest database otherwise as it includes more genomes and updated functions. You can see details about the PICRUSt2-SC database here. If you are using an older version of PICRUSt2, the instructions are below.
Note that you can type the option -h
to get a description of each below script.
- Run the entire pipeline with the updated PICRUSt2-SC database
-
Run each step individually
- 1. Place sequences into reference phylogeny
- 2. Run hidden-state prediction to get 16S copy numbers and NSTI per predicted genome
- 3. Determine the best domain for each sequence
- 4. Run hidden-state prediction to get EC number and KO abundances (or another trait) per predicted genome
- 5. Combine files from hidden-state prediction
- 6. Predict EC and KO abundances in samples
- 7. Infer MetaCyc pathway abundances and coverages (based on EC numbers)
- 8. Add descriptions as new column in gene family and pathway abundance tables
- Run the entire pipeline with the previous PICRUSt2-oldIMG database
-
Run each step individually with the previous PICRUSt2-oldIMG database
- 1. Place sequences into reference phylogeny
- 2. Run hidden-state prediction to get 16S copy numbers, NSTI, EC number and KO abundances per predicted genome
- 3. Predict EC and KO abundances in samples
- 4. Infer MetaCyc pathway abundances and coverages (based on EC numbers)
- 5. Add descriptions as new column in gene family and pathway abundance tables
- Run the entire pipeline (with PICRUSt2-oldIMG)
-
Run each step individually
- 1. Place sequences into reference phylogeny
- 2. Run hidden-state prediction to get 16S copy numbers, NSTI, EC number and KO abundances per predicted genome
- 3. Predict EC and KO abundances in samples
- 4. Infer MetaCyc pathway abundances and coverages (based on EC numbers)
- 5. Add descriptions as new column in gene family and pathway abundance tables
- Shuffling predictions
The entire pipeline can be run with this command:
picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline_split -p 1
If you would like to run each step individually you can also do that using the below commands. Using these commands is useful when you're running into problems using picrust2_pipeline.py
and want to isolate an issue or if you only want to re-run part of the PICRUSt2 pipeline.
Place sequences into bacterial reference phylogeny:
place_seqs.py -s study_seqs.fna -o bac_placed_seqs.tre -p 1 \
-r bacteria \
--intermediate placement_working_bac
Place sequences into archaeal reference phylogeny:
place_seqs.py -s study_seqs.fna -o arc_placed_seqs.tre -p 1 \
-r archaea \
--intermediate placement_working_arc
Note that it is not necessary to run both of these if, for example, you know that your dataset should only contain bacteria.
More details
Note that NSTI values will be added to the 16S prediction table (since the -n
option was given).
hsp.py -i 16S -r bacteria -t bac_placed_seqs.tre -o bac_marker_nsti_predicted.tsv.gz -p 1 -n
hsp.py -i 16S -r archaea -t arc_placed_seqs.tre -o arc_marker_nsti_predicted.tsv.gz -p 1 -n
Now we want to check whether the NSTI was lower for the bacterial or the archaeal references for each of our study sequences:
pick_best_domain.py \
-n1 bacteria \
-n2 archaea \
--tree_dom1 bac_placed_seqs.tre \
--tree_dom2 arc_placed_seqs.tre \
--tree_out_dom1 bac_reduced_placed_seqs.tre \
--tree_out_dom2 arc_reduced_placed_seqs.tre \
--nsti_table_dom1 bac_marker_nsti_predicted.tsv.gz \
--nsti_table_dom2 arc_marker_nsti_predicted.tsv.gz \
--nsti_table_out_dom1 bac_reduced_marker_nsti_predicted.tsv.gz \
--nsti_table_out_dom2 arc_reduced_marker_nsti_predicted.tsv.gz \
--nsti_table_out_combined combined_marker_nsti_predicted.tsv.gz
4. Run hidden-state prediction to get EC number and KO abundances (or another trait) per predicted genome (v2.6.0)
hsp.py -i EC -r bacteria -t bac_reduced_placed_seqs.tre -o bac_EC_predicted.tsv.gz -p 1
hsp.py -i KO -r bacteria -t bac_reduced_placed_seqs.tre -o bac_KO_predicted.tsv.gz -p 1
hsp.py -i EC -r archaea -t arc_reduced_placed_seqs.tre -o arc_EC_predicted.tsv.gz -p 1
hsp.py -i KO -r archaea -t arc_reduced_placed_seqs.tre -o arc_KO_predicted.tsv.gz -p 1
combine_domains.py --table_dom1 bac_EC_predicted.tsv.gz --table_dom2 arc_EC_predicted.tsv.gz -o combined_EC_predicted.tsv.gz
combine_domains.py --table_dom1 bac_KO_predicted.tsv.gz --table_dom2 arc_KO_predicted.tsv.gz -o combined_KO_predicted.tsv.gz
More details (Adjusts gene family abundances by 16S sequence abundance)
At this point, we have combined the predictions that we obtained separately for the bacterial and archaeal domains and are now running this in the same way as we did previously.
metagenome_pipeline.py -i study_seqs.biom \
-m combined_marker_nsti_predicted.tsv.gz \
-f combined_EC_predicted.tsv.gz \
-o EC_metagenome_out
metagenome_pipeline.py -i study_seqs.biom \
-m combined_marker_nsti_predicted.tsv.gz \
-f combined_KO_predicted.tsv.gz \
-o KO_metagenome_out
pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
-o pathways_out \
--intermediate pathways_working \
-p 1
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
-o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
-o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
-o pathways_out/path_abun_unstrat_descrip.tsv.gz
This is essentially using v2.6.0 to replicate the behaviour of earlier versions.
If you want to use the previous oldIMG database, you can run the entire pipeline like this:
picrust2_pipeline_oldIMG.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline_split -p 1
The entire pipeline can be run with this command (details):
picrust2_pipeline_oldIMG.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline -p 1
If you would like to run each step individually you can also do that using the below commands. Using these commands is useful when you're running into problems using picrust2_pipeline_oldIMG.py
and want to isolate an issue or if you only want to re-run part of the PICRUSt2 pipeline.
place_seqs.py -s study_seqs.fna -o placed_seqs.tre -p 1 \
--intermediate placement_working --ref_dir oldIMG
2. Run hidden-state prediction to get 16S copy numbers, NSTI, EC number and KO abundances per predicted genome
Note that NSTI values will be added to the 16S prediction table (since the -n
option was given).
hsp.py -i 16S -t placed_seqs.tre -o marker_nsti_predicted.tsv.gz -p 1 -n -db oldIMG
hsp.py -i EC -t placed_seqs.tre -o EC_predicted.tsv.gz -p 1 -db oldIMG
hsp.py -i KO -t placed_seqs.tre -o KO_predicted.tsv.gz -p 1 -db oldIMG
metagenome_pipeline.py -i study_seqs.biom \
-m marker_nsti_predicted.tsv.gz \
-f EC_predicted.tsv.gz \
-o EC_metagenome_out
metagenome_pipeline.py -i study_seqs.biom \
-m marker_nsti_predicted.tsv.gz \
-f KO_predicted.tsv.gz \
-o KO_metagenome_out
pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
-o pathways_out \
--intermediate pathways_working \
-p 1 \
-db oldIMG
add_descriptions_oldIMG.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
-o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions_oldIMG.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
-o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions_oldIMG.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
-o pathways_out/path_abun_unstrat_descrip.tsv.gz
The entire pipeline can be run with this command:
picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline -p 1
If you would like to run each step individually you can also do that using the below commands. Using these commands is useful when you're running into problems using picrust2_pipeline.py
and want to isolate an issue or if you only want to re-run part of the PICRUSt2 pipeline.
place_seqs.py -s study_seqs.fna -o placed_seqs.tre -p 1 \
--intermediate placement_working
2. Run hidden-state prediction to get 16S copy numbers, NSTI, EC number and KO abundances per predicted genome
Note that NSTI values will be added to the 16S prediction table (since the -n
option was given).
hsp.py -i 16S -t placed_seqs.tre -o marker_nsti_predicted.tsv.gz -p 1 -n
hsp.py -i EC -t placed_seqs.tre -o EC_predicted.tsv.gz -p 1
hsp.py -i KO -t placed_seqs.tre -o KO_predicted.tsv.gz -p 1
More details (Adjusts gene family abundances by 16S sequence abundance)
metagenome_pipeline.py -i study_seqs.biom \
-m marker_nsti_predicted.tsv.gz \
-f EC_predicted.tsv.gz \
-o EC_metagenome_out
metagenome_pipeline.py -i study_seqs.biom \
-m marker_nsti_predicted.tsv.gz \
-f KO_predicted.tsv.gz \
-o KO_metagenome_out
pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
-o pathways_out \
--intermediate pathways_working \
-p 1
add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
-o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
-o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
-o pathways_out/path_abun_unstrat_descrip.tsv.gz
An optional additional step is to shuffle the ASV labels in the genome prediction tables (i.e. the outputs of hsp.py
). Any analyses based on these shuffled tables can then be compared with analyses based on the actual data to check if there is more signal in the unshuffled data. See here for more details.
Please first check our FAQ if you have any questions about PICRUSt2.
For other general questions and comments about PICRUSt2 please search the PICRUSt google group. If the question has not been previously answered then please make a new thread.
To report a bug or to make a feature request please make a new issue at the top of this page.