diff --git a/tools/repeatexplorer2/.shed.yml b/tools/repeatexplorer2/.shed.yml
new file mode 100644
index 0000000..65722eb
--- /dev/null
+++ b/tools/repeatexplorer2/.shed.yml
@@ -0,0 +1,20 @@
+---
+auto_tool_repositories:
+ name_template: "{{ tool_id }}"
+ description_template: "{{ tool_name }} from the cactus suite"
+categories:
+ - Genome annotation
+description: Tool for annotation of repeats from unassembled shotgun reads.
+homepage_url: https://github.com/repeatexplorer/repex_tarean
+long_description: |
+ Tool for annotation of repeats from unassembled shotgun reads.
+name: repeatexplorer2
+owner: gga
+remote_repository_url: https://github.com/galaxy-genome-annotation/galaxy-tools/tree/master/tools/repeatexplorer2
+suite:
+ name: suite_repeatexplorer2
+ description: >
+ Tool for annotation of repeats from unassembled shotgun reads.
+ long_description: >
+ Tool for annotation of repeats from unassembled shotgun reads.
+type: unrestricted
diff --git a/tools/repeatexplorer2/macros.xml b/tools/repeatexplorer2/macros.xml
new file mode 100644
index 0000000..495039c
--- /dev/null
+++ b/tools/repeatexplorer2/macros.xml
@@ -0,0 +1,29 @@
+
+ 2.3.8
+ 0
+ 23.0
+
+
+ kavonrtep/repeatexplorer:@TOOL_VERSION@
+
+
+
+
+ @software{repeatexplorer2,
+ author = {repeatexplorer},
+ year = {2023},
+ title = {repeatexplorer2},
+ publisher = {GitHub},
+ url = {https://github.com/repeatexplorer/repex_tarean}
+ }
+
+
+
+
+
+
+
+
+
+
+
diff --git a/tools/repeatexplorer2/repex_full_clustering.xml b/tools/repeatexplorer2/repex_full_clustering.xml
new file mode 100644
index 0000000..af390ed
--- /dev/null
+++ b/tools/repeatexplorer2/repex_full_clustering.xml
@@ -0,0 +1,240 @@
+
+ repeat discovery and characterization using graph-based sequence clustering
+
+ macros.xml
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ =10 over 95% of bases
+ and no Ns allowed) and only **complete read pairs** should be submitted for
+ analysis. When paired reads are used, input data must be **interlaced** format
+ as fasta file:
+
+ example of interlaced input format::
+
+ >0001_f
+ CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG
+ >0001_r
+ GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT
+ >0002_f
+ ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG
+ >0002_r
+ TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC
+ >0003_f
+ TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
+ >0003_r
+ TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
+ ...
+
+
+ **Comparative analysis**
+
+ For comparative analysis sequence names must contain code (prefix) for each group.
+ Prefix in sequences names must be of fixed length.
+
+ Example of labeling two groups with where **group code length** is 2 and is used to distinguish groups - AA and BB ::
+
+ >AA0001_f
+ CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG
+ >AA0001_r
+ GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT
+ >AA0002_f
+ ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG
+ >AA0002_r
+ TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC
+ >BB0001_f
+ TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
+ >BB0001_r
+ TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
+ >BB0002_f
+ TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
+ >BB0002_r
+ TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
+
+
+ To prepare quality filtered and interlaced input fasta file from fastq
+ files, use `Preprocessing of paired-reads`__ tool.
+
+ .. __: tool_runner?tool_id=paired_fastq_filtering
+
+
+ **Additional parameters**
+
+ **Sample size** defines how many reads should be used in calculation.
+ Default setting with 500,000 reads will enable detection of high copy
+ repeats within several hours of computation time. For higher
+ sensitivity the sample size can be set higher. Since sample size affects
+ the memory usage, this parameter may be automatically adjusted to lower
+ value during the run. Maximum sample size which can be processed depends on
+ the repetitiveness of analyzed genome.
+
+
+ **Select taxon and protein domain database version (REXdb)**. Classification
+ of transposable elements is based on the similarity to our reference database
+ of transposable element protein domains (**REXdb**). Standalone database for Viridiplantae species
+ can be obtained on `repeatexplorer.org`__. Classification
+ system used in REXdb is described in article `Systematic survey of plant
+ LTR-retrotransposons elucidates phylogenetic relationships of their
+ polyprotein domains and provides a reference for element classification`__
+ Database for Metazoa species is still under development so use it with caution.
+
+ .. __: http://repeatexplorer.org
+ .. __: https://doi.org/10.1186/s13100-018-0144-1
+
+ **Select parameters for protein domain search** REXdb is compared with s
+ equence clusters either using blastx or diamond aligner. Diamond program
+ is about three time faster than blastx with word size 3.
+
+ **Similarity search options** By default sequence reads are compared using
+ mgblast program. Default threshold is explicitly set to 90% sequence
+ similarity spanning at least 55% of the read length (in the case of reads
+ differing in length it applies to the longer one). Additionally, sequence
+ overlap must be at least 55 nt. If you select option for shorter reads
+ than 100 nt, minimum overlap 55 nt is not required.
+
+ By default,
+ mgblast search use DUST program to filter out
+ low-complexity sequences. If you want
+ to increase sensitivity of detection of satellites with shorter monomer
+ use option with '*no masking of low complexity repeats*'. Note that omitting
+ DUST filtering will significantly increase running times
+
+
+ **Automatic filtering of abundant satellite repeats** perform clustering on
+ smaller dataset of sequence reads to detect abundant high confidence
+ satellite repeats. If such satellites are detected, sequence reads derived
+ from these satellites are depleted from input dataset. This step enable more
+ sensitive detection of less abundant repeats as more reads can be used
+ in clustering step.
+
+ **Use custom repeat database**. This option allows users to perform similarity
+ comparison of identified repeats to their custom databases. The repeat class must
+ be encoded in FASTA headers of database entries in order to allow correct
+ parsing of similarity hits. Required format for custom database sequence name is: ::
+
+ >reapeatname#class/subclass
+
+
+ **Output**
+
+ List of clusters identified as putative satellite repeats, their genomic
+ abundance and various cluster characteristics.
+
+ Output includes a **HTML summary** with table listing of all analyzed
+ clusters. More detailed information about clusters is provided in
+ additional files and directories. All results are also provided as
+ downloadable **zip archive**. Additionally a **log file** reporting
+ the progress of the computational pipeline is provided.
+
+ ]]>
+
+
diff --git a/tools/repeatexplorer2/test-data/LAS_paired_10k.fa.gz b/tools/repeatexplorer2/test-data/LAS_paired_10k.fa.gz
new file mode 100644
index 0000000..b70be43
Binary files /dev/null and b/tools/repeatexplorer2/test-data/LAS_paired_10k.fa.gz differ