Download miniconda intaller. For more information see https://docs.conda.io/en/latest/miniconda.html
# Download miniconda installer
cd ~/Downloads
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Verify checksum
sha256sum Miniconda3-latest-Linux-x86_64.sh
# Checksum can be found on miniconda documentation : https://docs.conda.io/en/latest/miniconda.html
Run installer;
# Install miniconda
bash Miniconda3-latest-Linux-x86_64.sh
bash # To activate conda base env.
On prompt:
- Confirm license and installation location
- Activate initialization of Miniconda3
- After installation, close and reopen terminal or just type
bash
conda create -n singularity -c conda-forge singularity=3.6.3
singularity image is available in https://github.com/repeatexplorer/repex_tarean/releases
conda activate singularity
cd
mkdir repeatexplorer && cd repeatexplorer
singularity pull https://github.com/repeatexplorer/repex_tarean/releases/download/0.3.8/repex_tarean_0.3.8.sif
singularity build --sandbox repex_tarean repex_tarean_0.3.8.sif
# Verify build
singularity exec repex_tarean seqclust --help
# Download small pair-end dataset:
wget https://bitbucket.org/petrnovak/repex_tarean/raw/devel/test_data/LAS_paired_10k.fas
singularity exec -e --bind ${PWD}:/data/ repex_tarean seqclust -p -v /data/re_test /data/LAS_paired_10k.fas
Get data and check the quality of reads using FastQC program:
# Clustering example 1
cd ~/repeatexplorer
mkdir single_species
cd single_species
# Sample of T.cacao pair-end Illumina reads
wget https://github.com/kavonrtep/example_data/raw/master/SRR089356_1.fastq.gz
wget https://github.com/kavonrtep/example_data/raw/master/SRR089356_2.fastq.gz
fastqc *.fastq.gz
When QC run finishes, inspect html reports.
Use Trimmomatic program to remove low quality reads and trim read ends. Trimmomatic will be used in Pair-end mode. Trimmomatic options are:
TrimmomaticPE [-threads threads] [-phred33 | -phred64] [-trimlog logFile] \
paired_output_1 unpaired_output_1 paired_output_2 unpaired_output_2 step_1 ...
The current trimming steps are:
- ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>
- *fastaWithAdaptersEtc*: specifies the path to a fasta file containing all the
adapters, PCR sequences etc. The naming of the various sequences within this
file determines how they are used.
- *seedMismatches*: specifies the
maximum mismatch count which will still allow a full match to be performed
- *palindromeClipThreshold*: specifies how accurate the match between the two
´adapter ligated´ reads must be for PE palindrome read alignment.
- *simpleClipThreshold*: specifies how accurate the match between any adapter etc.
sequence must be against a read.
- SLIDINGWINDOW:<windowSize>:<requiredQuality>
windowSize: specifies the number of bases to average across
requiredQuality: specifies the average quality required.
- LEADING:<quality>
quality: Specifies the minimum quality required to keep a base.
- TRAILING:<quality>
quality: Specifies the minimum quality required to keep a base.
- CROP:<length>
length: The number of bases to keep, from the start of the read.
- HEADCROP:<length>
length: The number of bases to remove from the start of the read.
- MINLEN:<length>
length: Specifies the minimum length of reads to be kept.
Trimming occurs in the order which the steps are specified on the command line. It is
recommended in most cases that adapter clipping, if required, is done as early as
possible.
# Adapter sequences are located in /usr/share/trimmomatic/
cp /usr/share/trimmomatic/*.fa .
# Remove first 10 nt, min length must be 90
TrimmomaticPE -phred33 SRR089356_1.fastq.gz SRR089356_2.fastq.gz \
SRR089356_1_clean.fastq.gz SRR089356_1_unpaired.fastq.gz \
SRR089356_2_clean.fastq.gz SRR089356_2_unpaired.fastq.gz \
ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:10 CROP:100 HEADCROP:10 MINLEN:90
# Check statistics of fastq files:
seqkit stats *fastq.gz
# Run fastqc on clean data:
fastqc *clean*.fastq.gz
# Paired end read sampling:
seqtk sample -s 10 SRR089356_1_clean.fastq.gz 5000 > SRR089356_1_clean_sample.fastq
seqtk sample -s 10 SRR089356_2_clean.fastq.gz 5000 > SRR089356_2_clean_sample.fastq
# Make interleaved FASTQ
seqtk mergepe SRR089356_1_clean_sample.fastq SRR089356_2_clean_sample.fastq seqtk > SRR089356_clean_sample_merged.fastq
# Convert to FASTA
seqtk seq -A SRR089356_clean_sample_merged.fastq > SRR089356_clean_sample_merged.fasta
# Run clustering with default settings
cd ~/repeatexplorer
singularity exec -e --bind ${PWD}:/data/ repex_tarean seqclust -p -v /data/re_output_run1 /data/single_species/SRR089356_clean_sample_merged.fasta
NOTE : current directory ($PWD
) is /data directory in singularity container.
seqclust [-h] [-p] [-A] [-t] [-l LOGFILE] [-m {float range 0.0..100.0}] [-M {0,float range 0.1..1}] [-o {float range 30.0..80.0}] [-c CPU]
[-s SAMPLE] [-P PREFIX_LENGTH] [-v OUTPUT_DIR] [-r MAX_MEMORY] [-d DATABASE DATABASE] [-C] [-k] [-a {2,3,4,5}]
[-tax {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}]
[-opt {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SENSITIVE_MGBLAST,ILLUMINA_SENSITIVE_BLASTPLUS,OXFORD_NANOPORE}]
[-D {BLASTX_W2,BLASTX_W3,DIAMOND}]
sequences
RepeatExplorer:
Repetitive sequence discovery and clasification from NGS data
positional arguments:
sequences
optional arguments:
-h, --help show this help message and exit
-p, --paired
-A, --automatic_filtering
-t, --tarean_mode analyze only tandem reapeats without additional classification
-l LOGFILE, --logfile LOGFILE
log file, logging goes to stdout if not defines
-m {float range 0.0..100.0}, --mincl {float range 0.0..100.0}
-M {0,float range 0.1..1}, --merge_threshold {0,float range 0.1..1}
threshold for mate-pair based cluster merging, default 0 - no merging
-o {float range 30.0..80.0}, --min_lcov {float range 30.0..80.0}
minimal overlap coverage - relative to longer sequence length, default 55
-c CPU, --cpu CPU number of cpu to use, if 0 use max available
-s SAMPLE, --sample SAMPLE
use only sample of input data[by default max reads is used
-P PREFIX_LENGTH, --prefix_length PREFIX_LENGTH
If you wish to keep part of the sequences name,
enter the number of characters which should be
kept (1-10) instead of zero. Use this setting if
you are doing comparative analysis
-v OUTPUT_DIR, --output_dir OUTPUT_DIR
-r MAX_MEMORY, --max_memory MAX_MEMORY
Maximal amount of available RAM in kB if not set
clustering tries to use whole available RAM
-d DATABASE DATABASE, --database DATABASE DATABASE
fasta file with database for annotation and name of database
-C, --cleanup remove unncessary large files from working directory
-k, --keep_names keep sequence names, by default sequences are renamed
-a {2,3,4,5}, --assembly_min {2,3,4,5}
Assembly is performed on individual clusters, by default
clusters with size less then 5 are not assembled. If you
want need assembly of smaller cluster set *assmbly_min*
accordingly
-tax {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}, --taxon {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}
Select taxon and protein database version
-opt {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SENSITIVE_MGBLAST,ILLUMINA_SENSITIVE_BLASTPLUS,OXFORD_NANOPORE}, --options {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SENSITIVE_MGBLAST,ILLUMINA_SENSITIVE_BLASTPLUS,OXFORD_NANOPORE}
ILLUMINA : standard option, all-to-all similarity search is
performed using mgblast, threshold for hits is 90 percent identity over
55 percent of the sequence length, word size is 18
ILLUMINA_SENSITIVE_MGBLAST : all-to-all search is performed using mgblast,
with word size 8 and threshold for hits is 80 percent identity over 55 percent of the sequence length
ILLUMINA_SENSITIVE_BLASTPLUS : all-to-all search is performed using blastn,
with word size 6 and threshold for hits is 80 percent identity over 55 percent of the sequence length
OXFORD_NANOPORE: experimental option, all-to-all search is performed using lastal program
-D {BLASTX_W2,BLASTX_W3,DIAMOND}, --domain_search {BLASTX_W2,BLASTX_W3,DIAMOND}
Detection of protein domains can be performed by either blastx or
diamond" program. options are:
BLASTX_W2 - blastx with word size 2 (slowest, the most sesitive)
BLASTX_W3 - blastx with word size 3 (default)
DIAMOND - diamond program (significantly faster, less sensitive)
To use this option diamond program must be installed in your PATH
# Get data from comparative analysis
cd ~/repeatexplorer
mkdir comparative
cd comparative
wget https://github.com/kavonrtep/example_data/raw/master/SRR9938304_1.fastq.gz
wget https://github.com/kavonrtep/example_data/raw/master/SRR9938304_2.fastq.gz
wget https://github.com/kavonrtep/example_data/raw/master/SRR089356_1.fastq.gz
wget https://github.com/kavonrtep/example_data/raw/master/SRR089356_2.fastq.gz
seqkit stats *.fastq.gz
fastqc *.fastq.gz
cp /usr/share/trimmomatic/*.fa .
TrimmomaticPE -phred33 SRR089356_1.fastq.gz SRR089356_2.fastq.gz \
SRR089356_1_clean.fastq.gz SRR089356_1_unpaired.fastq.gz \
SRR089356_2_clean.fastq.gz SRR089356_2_unpaired.fastq.gz \
ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:10 CROP:100 HEADCROP:10 MINLEN:90
TrimmomaticPE -phred33 SRR9938304_1.fastq.gz SRR9938304_2.fastq.gz \
SRR9938304_1_clean.fastq.gz SRR9938304_1_unpaired.fastq.gz \
SRR9938304_2_clean.fastq.gz SRR9938304_2_unpaired.fastq.gz \
ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 SLIDINGWINDOW:4:10 CROP:100 HEADCROP:10 MINLEN:90
seqtk sample -s 10 SRR089356_1_clean.fastq.gz 5000 > SRR089356_1_clean_sample.fastq
seqtk sample -s 10 SRR089356_2_clean.fastq.gz 5000 > SRR089356_2_clean_sample.fastq
seqtk sample -s 10 SRR9938304_1_clean.fastq.gz 5000 > SRR9938304_1_clean_sample.fastq
seqtk sample -s 10 SRR9938304_2_clean.fastq.gz 5000 > SRR9938304_2_clean_sample.fastq
seqtk mergepe SRR089356_1_clean_sample.fastq SRR089356_2_clean_sample.fastq seqtk > SRR089356_clean_sample_merged.fastq
seqtk mergepe SRR9938304_1_clean_sample.fastq SRR9938304_2_clean_sample.fastq seqtk > SRR9938304_clean_sample_merged.fastq
# Convert to FASTA
seqtk seq -A SRR089356_clean_sample_merged.fastq > SRR089356_clean_sample_merged.fasta
seqtk seq -A SRR9938304_clean_sample_merged.fastq > SRR9938304_clean_sample_merged.fasta
# Add prefixes CA, CB
seqtk rename SRR089356_clean_sample_merged.fasta CA > prefix_SRR089356_clean_sample_merged.fasta
seqtk rename SRR9938304_clean_sample_merged.fasta CB > prefix_SRR9938304_clean_sample_merged.fasta
cat prefix* > CA_CB_final.fasta
cd ~/repeatexplorer
singularity exec -e --bind ${PWD}:/data/ repex_tarean seqclust --paired --prefix_length 2 -v /data/re_output_comparative /data/comparative/CA_CB_final.fasta