This pipeline is an automatic structural annotation workflows written in snakemake. The annotation is based on two tools which uses RNA-Seq and/or protein homology information for predict coding sequence. One of this tools is BRAKER which use GeneMark-EX and AUGUSTUS. And the other tool is AUGUSTUS alone for improve annotation of small coding sequences with few or no intron. Before the annotation, the repeat element of genome are masked for avoid annotation probleme. In addition this workflows can perform a illumina assembly with ABySS using différent value of kmere.
#Installation
For install the annotation Workflows, please use this command :
git clone https://github.com/FlorianCHA/AssemblyAndAnnotation_pipeline.git
This workflows use many tools for assembly, mapping, annotation and quality control. For installation of softwre two option are available. You can install all tools mannually or you can use the singularity launcher without install any tools needed in this workflows.
If you want download all software, please complete the software part of config.yaml file.
- RepeatMasker if you want mask the repeat element of your genomes
- ABySS if you want assembled you illumina fastq
All containers for the workflows are available here. If you use the 'Launcher_singularty.sh', the workflows download all singularity containers needed. You only need to download the genemark-ES licence here
To run the workflows you have to provide the data path for all input file. Please complete the config.yaml file for launch the workflow.
# If you want assembly with ABySS you illumina data please complete this part else, pass this part (keep every path empty '')
FASTQ: '/path/to/fastq/directory/'
SUFFIX_FASTQ_R1 : '_R1.fastq.gz'
SUFFIX_FASTQ_R2 : '_R1.fastq.gz'
- FASTQ : Path of you directory which contain all your fastq file to assemble, if you let empty the path the workflown don't assembled and use fasta file (give in the FASTA option) for the annotation step.
- SUFFIX_FASTQ_R1 : Etension of your R1 fastq files contains in FASTQ directory (for exemple : '_R1.fastq.gz' )
- SUFFIX_FASTQ_R2 : Etension of your R2 fastq files contains in FASTQ directory (for exemple : '_R2.fastq.gz' )
ET_DB: '/path/to/repeat_element_db.fasta'
- ET_DB : Path of the repeat element data base for repeatMasker, if you let empty the path, the workflow don't mask the repeat element of the genome
FASTA:'/path/to/fasta/directory/'
SUFFIX_FASTA : '.fasta'
RNAseq_DIR : 'path/to/RNA_seqfastq/directory/'
SUFFIX_RNAseq : '.fastq.gz'
ID_SPECIES: 'arabidopsis'
PROTEIN_REF: '/path/to/protein_ref.fasta'
GM_KEY : '/path/to/gm_key_64'
- FASTA : Path of you directory which contain all your fasta file to annotate. If the FASTQ option is empty please give a correct path else you can let empty this option.
- RNAseq_DIR : Path of the directory which contain all RNAseq data, if you kepts this path empty this pipeline run only augustus
- SUFFIX_RNAseq : Etension of your fastq files contains in FASTQ directory (for exemple : '.fastq.gz','fq.gz ','fq' , etc. )
- ID_SPECIES : ID of species for augustus trainings, please refers to augustus main page for this option
- PROTEIN_REF : Path of the protein fasta file, if you don't have this file you can kept empty this option ('')
- GM_KEY : Path of the licence for Genemarks-ES (please clik here for download the licence).
- OUTPUT : Output directory for all results of this pipeline