Skip to content

Latest commit

 

History

History
295 lines (247 loc) · 13.4 KB

practical_training_day_3.org

File metadata and controls

295 lines (247 loc) · 13.4 KB

Practical training - day 3

Running RepeatExplorer2 from command line

Installation of RepeatExplorer2

Installation of miniconda

Download miniconda intaller. For more information see https://docs.conda.io/en/latest/miniconda.html

# Download miniconda installer
cd ~/Downloads
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Verify checksum
sha256sum Miniconda3-latest-Linux-x86_64.sh
# Checksum can be found on miniconda documentation : https://docs.conda.io/en/latest/miniconda.html

Run installer;

# Install miniconda
bash Miniconda3-latest-Linux-x86_64.sh
bash  # To activate conda base env.

On prompt:

  • Confirm license and installation location
  • Activate initialization of Miniconda3
  • After installation, close and reopen terminal or just type bash

Installation of Singularity

conda create -n singularity -c conda-forge singularity=3.6.3

Get RepeatExplorer2 singularity image:

singularity image is available in https://github.com/repeatexplorer/repex_tarean/releases

conda activate singularity
cd 
mkdir repeatexplorer && cd repeatexplorer
singularity pull https://github.com/repeatexplorer/repex_tarean/releases/download/0.3.8/repex_tarean_0.3.8.sif
singularity build  --sandbox repex_tarean repex_tarean_0.3.8.sif
# Verify build
singularity exec repex_tarean seqclust --help

Run clustering on test data:

# Download small pair-end dataset:
wget https://bitbucket.org/petrnovak/repex_tarean/raw/devel/test_data/LAS_paired_10k.fas
singularity exec -e --bind ${PWD}:/data/ repex_tarean  seqclust  -p -v /data/re_test /data/LAS_paired_10k.fas

Example of clustering including data pre-processing

Data:

Get data and check the quality of reads using FastQC program:

# Clustering example 1
cd ~/repeatexplorer
mkdir single_species
cd single_species
# Sample of T.cacao pair-end Illumina reads
wget https://github.com/kavonrtep/example_data/raw/master/SRR089356_1.fastq.gz
wget https://github.com/kavonrtep/example_data/raw/master/SRR089356_2.fastq.gz
fastqc *.fastq.gz

When QC run finishes, inspect html reports.

Quality filtering

Use Trimmomatic program to remove low quality reads and trim read ends. Trimmomatic will be used in Pair-end mode. Trimmomatic options are:

TrimmomaticPE  [-threads threads] [-phred33 | -phred64] [-trimlog logFile] \
paired_output_1 unpaired_output_1 paired_output_2 unpaired_output_2 step_1 ...

The current trimming steps are:
- ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>
  - *fastaWithAdaptersEtc*: specifies the path to a fasta file containing all the
  adapters, PCR sequences etc. The naming of the various sequences within this
  file determines how they are used.
  -  *seedMismatches*: specifies the
  maximum mismatch count which will still allow a full match to be performed
  - *palindromeClipThreshold*: specifies how accurate the match between the two
  ´adapter ligated´ reads must be for PE palindrome read alignment.
  - *simpleClipThreshold*: specifies how accurate the match between any adapter etc.
  sequence must be against a read.
- SLIDINGWINDOW:<windowSize>:<requiredQuality>
                   windowSize: specifies the number of bases to average across
                   requiredQuality: specifies the average quality required.

- LEADING:<quality>
                   quality: Specifies the minimum quality required to keep a base.
- TRAILING:<quality>
                   quality: Specifies the minimum quality required to keep a base.
- CROP:<length>
                   length: The number of bases to keep, from the start of the read.
- HEADCROP:<length>
                   length: The number of bases to remove from the start of the read.
- MINLEN:<length>
                   length: Specifies the minimum length of reads to be kept.
 Trimming occurs in the order which the steps are specified on the command line. It  is
           recommended  in  most  cases  that  adapter clipping, if required, is done as early as
           possible.
# Adapter sequences are located in /usr/share/trimmomatic/
cp  /usr/share/trimmomatic/*.fa .

# Remove first 10 nt, min length must be 90
TrimmomaticPE -phred33 SRR089356_1.fastq.gz SRR089356_2.fastq.gz \
 SRR089356_1_clean.fastq.gz SRR089356_1_unpaired.fastq.gz \
 SRR089356_2_clean.fastq.gz SRR089356_2_unpaired.fastq.gz \
 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:10 CROP:100 HEADCROP:10 MINLEN:90

# Check statistics of fastq files:
seqkit stats *fastq.gz
# Run fastqc on clean data:
fastqc *clean*.fastq.gz

Sample to required coverage:

# Paired end read sampling:
seqtk sample -s 10  SRR089356_1_clean.fastq.gz 5000 >  SRR089356_1_clean_sample.fastq
seqtk sample -s 10  SRR089356_2_clean.fastq.gz 5000 >  SRR089356_2_clean_sample.fastq

Interleaved pairs into single file:

# Make interleaved FASTQ
seqtk mergepe SRR089356_1_clean_sample.fastq SRR089356_2_clean_sample.fastq seqtk  > SRR089356_clean_sample_merged.fastq
# Convert to FASTA
seqtk seq -A SRR089356_clean_sample_merged.fastq > SRR089356_clean_sample_merged.fasta

run RepeatExplorer with default settings:

# Run clustering with default settings
cd ~/repeatexplorer
singularity exec -e --bind ${PWD}:/data/ repex_tarean  seqclust  -p -v /data/re_output_run1 /data/single_species/SRR089356_clean_sample_merged.fasta

NOTE : current directory ($PWD) is /data directory in singularity container.

Command line options:

seqclust  [-h] [-p] [-A] [-t] [-l LOGFILE] [-m {float range 0.0..100.0}] [-M {0,float range 0.1..1}] [-o {float range 30.0..80.0}] [-c CPU]
                [-s SAMPLE] [-P PREFIX_LENGTH] [-v OUTPUT_DIR] [-r MAX_MEMORY] [-d DATABASE DATABASE] [-C] [-k] [-a {2,3,4,5}]
                [-tax {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}]
                [-opt {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SENSITIVE_MGBLAST,ILLUMINA_SENSITIVE_BLASTPLUS,OXFORD_NANOPORE}]
                [-D {BLASTX_W2,BLASTX_W3,DIAMOND}]
                sequences

RepeatExplorer:
    Repetitive sequence discovery and clasification from NGS data

  

positional arguments:
  sequences

optional arguments:
  -h, --help            show this help message and exit
  -p, --paired
  -A, --automatic_filtering
  -t, --tarean_mode     analyze only tandem reapeats without additional classification
  -l LOGFILE, --logfile LOGFILE
                        log file, logging goes to stdout if not defines
  -m {float range 0.0..100.0}, --mincl {float range 0.0..100.0}
  -M {0,float range 0.1..1}, --merge_threshold {0,float range 0.1..1}
                        threshold for mate-pair based cluster merging, default 0 - no merging
  -o {float range 30.0..80.0}, --min_lcov {float range 30.0..80.0}
                        minimal overlap coverage - relative to longer sequence length, default 55
  -c CPU, --cpu CPU     number of cpu to use, if 0 use max available
  -s SAMPLE, --sample SAMPLE
                        use only sample of input data[by default max reads is used
  -P PREFIX_LENGTH, --prefix_length PREFIX_LENGTH
                        If you wish to keep part of the sequences name,
                         enter the number of characters which should be 
                        kept (1-10) instead of zero. Use this setting if
                         you are doing comparative analysis
  -v OUTPUT_DIR, --output_dir OUTPUT_DIR
  -r MAX_MEMORY, --max_memory MAX_MEMORY
                        Maximal amount of available RAM in kB if not set
                        clustering tries to use whole available RAM
  -d DATABASE DATABASE, --database DATABASE DATABASE
                        fasta file with database for annotation and name of database
  -C, --cleanup         remove unncessary large files from working directory
  -k, --keep_names      keep sequence names, by default sequences are renamed
  -a {2,3,4,5}, --assembly_min {2,3,4,5}
                        Assembly is performed on individual clusters, by default 
                        clusters with size less then 5 are not assembled. If you 
                        want need assembly of smaller cluster set *assmbly_min* 
                        accordingly
  -tax {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}, --taxon {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}
                        Select taxon and protein database version
  -opt {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SENSITIVE_MGBLAST,ILLUMINA_SENSITIVE_BLASTPLUS,OXFORD_NANOPORE}, --options {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SENSITIVE_MGBLAST,ILLUMINA_SENSITIVE_BLASTPLUS,OXFORD_NANOPORE}
                        ILLUMINA : standard option, all-to-all similarity search is
                        performed using mgblast, threshold for hits is 90 percent identity over
                        55 percent of the sequence length, word size is 18
                      
                        ILLUMINA_SENSITIVE_MGBLAST : all-to-all search is performed using mgblast,
                        with  word size 8 and threshold for hits is 80 percent identity over 55 percent of the sequence length
                      
                        ILLUMINA_SENSITIVE_BLASTPLUS : all-to-all search is performed using blastn,
                        with  word size 6 and threshold for hits is 80 percent identity over 55 percent of the sequence length
                      
                        OXFORD_NANOPORE: experimental option, all-to-all search is performed using lastal program
  -D {BLASTX_W2,BLASTX_W3,DIAMOND}, --domain_search {BLASTX_W2,BLASTX_W3,DIAMOND}
                        Detection of protein domains can be performed by either blastx or
                         diamond" program. options are:
                          BLASTX_W2 - blastx with word size 2 (slowest, the most sesitive)
                          BLASTX_W3 - blastx with word size 3 (default)
                          DIAMOND   - diamond program (significantly faster, less sensitive)
                        To use this option diamond program must be installed in your PATH

Running comparative analysis

data:

# Get data from comparative analysis
cd ~/repeatexplorer
mkdir comparative
cd comparative
wget  https://github.com/kavonrtep/example_data/raw/master/SRR9938304_1.fastq.gz
wget  https://github.com/kavonrtep/example_data/raw/master/SRR9938304_2.fastq.gz
wget  https://github.com/kavonrtep/example_data/raw/master/SRR089356_1.fastq.gz
wget  https://github.com/kavonrtep/example_data/raw/master/SRR089356_2.fastq.gz
seqkit stats *.fastq.gz

Quality control and filtering:

fastqc *.fastq.gz
cp  /usr/share/trimmomatic/*.fa .
TrimmomaticPE -phred33 SRR089356_1.fastq.gz SRR089356_2.fastq.gz \
 SRR089356_1_clean.fastq.gz SRR089356_1_unpaired.fastq.gz \
 SRR089356_2_clean.fastq.gz SRR089356_2_unpaired.fastq.gz \
 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:10 CROP:100 HEADCROP:10 MINLEN:90

TrimmomaticPE -phred33 SRR9938304_1.fastq.gz SRR9938304_2.fastq.gz \
 SRR9938304_1_clean.fastq.gz SRR9938304_1_unpaired.fastq.gz \
 SRR9938304_2_clean.fastq.gz SRR9938304_2_unpaired.fastq.gz \
 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:10 CROP:100 HEADCROP:10 MINLEN:90

Sample to required coverage:

seqtk sample -s 10  SRR089356_1_clean.fastq.gz 5000 >  SRR089356_1_clean_sample.fastq
seqtk sample -s 10  SRR089356_2_clean.fastq.gz 5000 >  SRR089356_2_clean_sample.fastq

seqtk sample -s 10  SRR9938304_1_clean.fastq.gz 5000 >  SRR9938304_1_clean_sample.fastq
seqtk sample -s 10  SRR9938304_2_clean.fastq.gz 5000 >  SRR9938304_2_clean_sample.fastq

Interleave:

seqtk mergepe SRR089356_1_clean_sample.fastq SRR089356_2_clean_sample.fastq seqtk  > SRR089356_clean_sample_merged.fastq
seqtk mergepe SRR9938304_1_clean_sample.fastq SRR9938304_2_clean_sample.fastq seqtk  > SRR9938304_clean_sample_merged.fastq
# Convert to FASTA
seqtk seq -A SRR089356_clean_sample_merged.fastq > SRR089356_clean_sample_merged.fasta
seqtk seq -A SRR9938304_clean_sample_merged.fastq > SRR9938304_clean_sample_merged.fasta

Add prefix and concatenate :

# Add prefixes CA, CB
seqtk rename SRR089356_clean_sample_merged.fasta CA > prefix_SRR089356_clean_sample_merged.fasta
seqtk rename SRR9938304_clean_sample_merged.fasta CB > prefix_SRR9938304_clean_sample_merged.fasta
cat prefix* > CA_CB_final.fasta

Comparative clustering:

cd ~/repeatexplorer
singularity exec -e --bind ${PWD}:/data/ repex_tarean  seqclust  --paired --prefix_length 2  -v /data/re_output_comparative /data/comparative/CA_CB_final.fasta