This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The pipeline is built using Nextflow and processes data using the following steps:
- Repeat annotation
- Repeat masking
- RNASeq trimming, filtering and QC
- RNASeq alignment
- Annotation with BRAKER
- Annotation with Liftoff
- Annotation filtering and merging
- Functional annotation
- Orthology inference
- Final annotation files
- Annotation QC
- Pipeline information and MultiQC
Output files
repeatmodeler/
*.fa
: Repeat library
edta/
*.EDTA.TElib.fa
: Repeat library
A repeat library is created with either REPEATMODELER or EDTA. The choice of the tool is specified by the repeat_annotator
parameter (default: repeatmodeler
). Repeat annotation outputs are saved to the output directory only if save_annotated_te_lib
parameter is set to true
(default: false
).
Output files
repeatmasker/
*.masked
: Masked assembly
Soft masking of the repeats is performed with REPEATMASKER using the repeat library prepared in the previous step. Masking outputs are saved to the output directory only if repeatmasker_save_outputs
parameter is set to true
(default: false
).
Output files
fastqc_raw/
*.html
: HTML QC report for a sample before trimming*.zip
: Zipped QC files for a sample before trimming
fastqc_trim/
*.html
: HTML QC report for a sample after trimming*.zip
: Zipped QC files for a sample after trimming
fastp/
html/
*.fastp.html
: HTML trimming report for a sample
json/
*.fastp.json
: Trimming statistics for a sample
log/
*.fastp.log
: Trimming log for a sample
*_{1,2}.fail.fastq.gz
: Reads which failed trimming*.paired.fail.fastq.gz
: Pairs of reads which failed trimming*.merged.fastq.gz
: Reads which passed trimming. For paired reads, reads 1 and 2 are merged into a single file
sortmerna/
*.sortmerna.log
: Filtering log for a sample*_{1,2}.non_rRNA.fastq.gz
: Filtered reads
RNASeq reads are trimmed with FASTP and are QC'ed with FASTQC. Ribosomal reads are filtered out using SORTMERNA. Trimmed reads are only stored to the output directory if the save_trimmed
parameter is set to true
(default: false
). Reads filtered by SORTMERNA are stored to the output directory if the save_non_ribo_reads
parameter is set to true
(default: false
).
Output files
star/
alignment/
X.on.Y.Aligned.sortedByCoord.out.bam
: Sorted BAM file of read alignments for sampleX
against referenceY
X.on.Y.Log.final.out
: STAR final log file for sampleX
against referenceY
cat_bam/
Y.bam
: A single BAM file for referenceY
created by concatenating alignments from sample-wise*.on.Y.Aligned.sortedByCoord.out.bam
files
RNASeq alignment is performed with STAR. Alignment files are only stored to the output directory if the star_save_outputs
parameter is set to true
(default: false
). Concatenated bam files are stored to the output directory if the save_cat_bam
parameter is set to true
(default: false
).
Output files
etc/braker/
Y/
braker.gff3
: Gene models predicted by BRAKER in GFF3 formatbraker.gtf
: Gene models predicted by BRAKER in GTF formatbraker.codingseq
: Coding sequences for the predicted genesbraker.aa
: Protein sequences for the predicted genesbraker.log
: BRAKER log filehintsfile.gff
: Evidential hints used by BRAKER in GFF formatwhat-to-cite.txt
: A list of references which must be cited when reporting outputs created by BRAKER
BRAKER is used to annotate each genome assembly using the provide protein and RNASeq evidence. Outputs from BRAKER are stored to the output directory if the braker_save_outputs
parameter is set to true
(default: false
).
Caution
BRAKER outputs are not the final outputs from the pipeline and that's why they are not stored by default. These are only intermediary files.
The pipeline further processes the BRAKER predictions and stores the final validated outputs in the annotations
directory. The braker_save_outputs
option is only provided to allow a manual resume of the pipeline for advanced use cases. See Advanced inputs for manual resume in the usage doc.
Gene models are lifted from reference assembly(ies) to the target assembly using LIFTOFF. Currently, the outputs from Liftoff are considered intermediary and an option to store them in the output directory is not available.
Annotations obtained from BRAKER and LIFTOFF are filtered with TSEBRA and merged with AGAT. Currently, the outputs from these processes are considered intermediary and an option to store them in the output directory is not available.
Output files
annotations/
Y/
Y.emapper.annotations
: TSV with the annotation resultsY.emapper.hits
: TSV with the search resultsY.emapper.seed_orthologs
: TSV with the results from parsing the hits, linking queries with seed orthologs
Functional annotation of the gene models from BRAKER and Liftoff is performed with EGGNOG-MAPPER.
Output files
orthofinder/
Comparative_Genomics_Statistics/
Gene_Duplication_Events/
Orthogroups/
Phylogenetic_Hierarchical_Orthogroups/
Species_Tree/
If more than one genome is included in the pipeline, ORTHOFINDER is used to perform an orthology inference.
Output files
annotations/
Y/
Y.gt.gff3
: Final annotation file for genomeY
which contains gene models and their functional annotationsY.pep.fasta
: Protein sequences for the gene models
The final annotation files are saved in GFF3 format validated with GENOMETOOLS and FASTA format obtained with GFFREAD.
Output files
busco/
gff/
short_summary.specific.Y.eudicots_odb10.txt
: BUSCO summary for annotations from genomeY
against theeudicots_odb10
databasebusco_figure
: BUSCO summary figure including statistics for annotations from all the genomes
fasta/
short_summary.specific.Y.eudicots_odb10.txt
: BUSCO summary for genomeY
against theeudicots_odb10
databasebusco_figure
: BUSCO summary figure including statistics for all the genomes
etc/
splicing_marked/
Y.gff3
: Final annotation file for genomeY
which contains gene models and their functional annotations. Additionally, the intron features are marked as canonical or non-canonical and the splice motif is also added an attribute.
The completeness of the annotations is checked with BUSCO. To provide a comparative baseline, the completeness of the genomes is also checked. Moreover, the canonical/non-canonical splicing of the introns is also assessed by the pipeline.
Output files
multiqc_report.html
: A MultiQC report which includes QC statistics, software versions and referencesgenepal_report.html
: A specialized pangene analysis reportgenepal_data/
: Files containing parsed data from the reporting module
MultiQC and R Markdown are used to curate the results of the pipeline in two HTML reports. The MultiQC report is meant to serve as an exhaustive report of the pipeline outputs. Whereas, the R Markdown report (genepal_report.html
) provides an overall summary along with a pangene analysis if multiple genomes are provided as input.
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andgenepal_software_mqc_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter's are used when running the pipeline. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. MultiQC compiles a HTML report from the tools used by the pipeline.