diff --git a/tools/repeatexplorer2/.shed.yml b/tools/repeatexplorer2/.shed.yml new file mode 100644 index 0000000..65722eb --- /dev/null +++ b/tools/repeatexplorer2/.shed.yml @@ -0,0 +1,20 @@ +--- +auto_tool_repositories: + name_template: "{{ tool_id }}" + description_template: "{{ tool_name }} from the cactus suite" +categories: + - Genome annotation +description: Tool for annotation of repeats from unassembled shotgun reads. +homepage_url: https://github.com/repeatexplorer/repex_tarean +long_description: | + Tool for annotation of repeats from unassembled shotgun reads. +name: repeatexplorer2 +owner: gga +remote_repository_url: https://github.com/galaxy-genome-annotation/galaxy-tools/tree/master/tools/repeatexplorer2 +suite: + name: suite_repeatexplorer2 + description: > + Tool for annotation of repeats from unassembled shotgun reads. + long_description: > + Tool for annotation of repeats from unassembled shotgun reads. +type: unrestricted diff --git a/tools/repeatexplorer2/macros.xml b/tools/repeatexplorer2/macros.xml new file mode 100644 index 0000000..495039c --- /dev/null +++ b/tools/repeatexplorer2/macros.xml @@ -0,0 +1,29 @@ + + 2.3.8 + 0 + 23.0 + + + kavonrtep/repeatexplorer:@TOOL_VERSION@ + + + + + @software{repeatexplorer2, + author = {repeatexplorer}, + year = {2023}, + title = {repeatexplorer2}, + publisher = {GitHub}, + url = {https://github.com/repeatexplorer/repex_tarean} + } + + + + + + + + + + + diff --git a/tools/repeatexplorer2/repex_full_clustering.xml b/tools/repeatexplorer2/repex_full_clustering.xml new file mode 100644 index 0000000..af390ed --- /dev/null +++ b/tools/repeatexplorer2/repex_full_clustering.xml @@ -0,0 +1,240 @@ + + repeat discovery and characterization using graph-based sequence clustering + + macros.xml + + + + + + + + + + + + + + +
+ + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + =10 over 95% of bases + and no Ns allowed) and only **complete read pairs** should be submitted for + analysis. When paired reads are used, input data must be **interlaced** format + as fasta file: + + example of interlaced input format:: + + >0001_f + CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG + >0001_r + GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT + >0002_f + ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG + >0002_r + TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC + >0003_f + TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT + >0003_r + TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT + ... + + + **Comparative analysis** + + For comparative analysis sequence names must contain code (prefix) for each group. + Prefix in sequences names must be of fixed length. + + Example of labeling two groups with where **group code length** is 2 and is used to distinguish groups - AA and BB :: + + >AA0001_f + CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG + >AA0001_r + GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT + >AA0002_f + ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG + >AA0002_r + TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC + >BB0001_f + TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT + >BB0001_r + TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT + >BB0002_f + TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT + >BB0002_r + TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT + + + To prepare quality filtered and interlaced input fasta file from fastq + files, use `Preprocessing of paired-reads`__ tool. + + .. __: tool_runner?tool_id=paired_fastq_filtering + + + **Additional parameters** + + **Sample size** defines how many reads should be used in calculation. + Default setting with 500,000 reads will enable detection of high copy + repeats within several hours of computation time. For higher + sensitivity the sample size can be set higher. Since sample size affects + the memory usage, this parameter may be automatically adjusted to lower + value during the run. Maximum sample size which can be processed depends on + the repetitiveness of analyzed genome. + + + **Select taxon and protein domain database version (REXdb)**. Classification + of transposable elements is based on the similarity to our reference database + of transposable element protein domains (**REXdb**). Standalone database for Viridiplantae species + can be obtained on `repeatexplorer.org`__. Classification + system used in REXdb is described in article `Systematic survey of plant + LTR-retrotransposons elucidates phylogenetic relationships of their + polyprotein domains and provides a reference for element classification`__ + Database for Metazoa species is still under development so use it with caution. + + .. __: http://repeatexplorer.org + .. __: https://doi.org/10.1186/s13100-018-0144-1 + + **Select parameters for protein domain search** REXdb is compared with s + equence clusters either using blastx or diamond aligner. Diamond program + is about three time faster than blastx with word size 3. + + **Similarity search options** By default sequence reads are compared using + mgblast program. Default threshold is explicitly set to 90% sequence + similarity spanning at least 55% of the read length (in the case of reads + differing in length it applies to the longer one). Additionally, sequence + overlap must be at least 55 nt. If you select option for shorter reads + than 100 nt, minimum overlap 55 nt is not required. + + By default, + mgblast search use DUST program to filter out + low-complexity sequences. If you want + to increase sensitivity of detection of satellites with shorter monomer + use option with '*no masking of low complexity repeats*'. Note that omitting + DUST filtering will significantly increase running times + + + **Automatic filtering of abundant satellite repeats** perform clustering on + smaller dataset of sequence reads to detect abundant high confidence + satellite repeats. If such satellites are detected, sequence reads derived + from these satellites are depleted from input dataset. This step enable more + sensitive detection of less abundant repeats as more reads can be used + in clustering step. + + **Use custom repeat database**. This option allows users to perform similarity + comparison of identified repeats to their custom databases. The repeat class must + be encoded in FASTA headers of database entries in order to allow correct + parsing of similarity hits. Required format for custom database sequence name is: :: + + >reapeatname#class/subclass + + + **Output** + + List of clusters identified as putative satellite repeats, their genomic + abundance and various cluster characteristics. + + Output includes a **HTML summary** with table listing of all analyzed + clusters. More detailed information about clusters is provided in + additional files and directories. All results are also provided as + downloadable **zip archive**. Additionally a **log file** reporting + the progress of the computational pipeline is provided. + + ]]> + +
diff --git a/tools/repeatexplorer2/test-data/LAS_paired_10k.fa.gz b/tools/repeatexplorer2/test-data/LAS_paired_10k.fa.gz new file mode 100644 index 0000000..b70be43 Binary files /dev/null and b/tools/repeatexplorer2/test-data/LAS_paired_10k.fa.gz differ