- Last modified: fre dec 01, 2023 02:43
- Sign: JN
Genomic data sets from the publication [Jarvis et al. (2014)] [1], was downloaded and polished into two sets; Introns and Exons.
These two data sets are available for download here:
-
Exons, compressed folder "fasta_files.tgz", 70 MB: 8,253 fasta files with 42--48 sequences per file, and seq length in the range of 99--15,777 bp.
$ wget -O fasta_files.tgz "https://nrmcloud.nrm.se/s/dbiPtxJnasb4Yze/download"
-
Introns, compressed folder "fasta_files.tgz" 164 MB: 11,013 fasta files with 38--48 sequences per file, and seq length in the range of 58--38,848 bp.
$ wget -O fasta_files.tgz "https://nrmcloud.nrm.se/s/TW3XK4BKA5HtHQX/download"
Description of the original data can be found here http://gigadb.org/dataset/101041, and here ftp://parrot.genomics.cn/gigadb/pub/10.5524/101001_102000/101041/readme.txt.
A taxon-translation file for the acronyms used in the fasta headers can be found here: species_list.csv. The source of this file was Table S6 from Jarvis et al., 2014 (Science 346, 1320).
Two additional subsets of the exon and intron data, including only files containing sequences of lengths between 200 and 5,000 bp (see below), are also available here:
-
Length-filtered Exons, compressed folder "fasta_files.tgz", 60.3 MB: 7,979 fasta files with 42--48 sequences per file, and seq length in the range of 201--4,989 bp.
$ wget -O fasta_files.tgz "https://nrmcloud.nrm.se/s/nyoFF5ooZRoGsRA/download"
-
Length-filtered Introns, compressed folder "fasta_files.tgz" 100.7 MB: 9,362 fasta files with 38--48 sequences per file, and seq length in the range of 200--4,998 bp.
$ wget -O fasta_files.tgz "https://nrmcloud.nrm.se/s/SjH32GsX3222Nzp/download"
To save some time in the Birdscanner pipeline, the HMM's and summary fasta
file are also provided. If these data sets are to be used in
Birdscanner, it is
recommended to download and uncompress the data (selected.tgz
) directly
inside the birdscanner/data/reference/
folder and proceed with the SLURM
script init_and_plast.slurm.sh
. These data can be downloaded here:
-
Length-filtered Exons with HMM's. Compressed folder "selected.tgz", 233.4 MB.
$ wget -O selected.tgz "https://nrmcloud.nrm.se/s/QWqB2xASW4XDTFN/download"
-
Length-filtered Introns with HMM's. Compressed folder "selected.tgz" 389.5 MB.
$ wget -O selected.tgz "https://nrmcloud.nrm.se/s/WjqyWJYiT6GJdeG/download"
When using sequences in searches using HMMer, it might be beneficial to filter
the files based on sequence length. This can be done in many ways. Here are
some examples using the combination of
get_fasta_info.pl
,
grepfasta.pl
, degap_fasta.pl
,
parallel
, and awk
:
Here we copy alignments of min. length 200, and max length 5,000 to a new folder.
$ mkdir -p selected/one
$ cp $(get_fasta_info.pl fasta_files/*.fas 2>/dev/null | \
awk '$2 >= 200 && $2 <= 5000 {print $NF}') selected/one
If you wish to select a subset of the taxa in the alignments, one may do so by using, for example:
First, find all fasta headers
$ grep -h '>' fasta_files/*.fas | sort -u | paste - - - -
>ACACH >ANAPL >APAVI >APTFO
>BALRE >BUCRH >CALAN >CAPCA
>CARCR >CATAU >CHAPE >CHAVO
>CHLUN >COLLI >COLST >CORBR
>CUCCA >EGRGA >EURHE >FALPE
>FULGL >GALGA >GAVST >GEOFO
>HALAL >HALLE >LEPDI >MANVI
>MELGA >MELUN >MERNU >MESUN
>NESNO >NIPNI >OPHHO >PELCR
>PHACA >PHALE >PHORU >PICPU
>PODCR >PTEGU >PYGAD >STRCA
>TAEGU >TAUER >TINMA >TYTAL
Create a file with the headers you wish to extract:
$ echo -e "FULGL\nGALGA\nGAVST\nGEOFO" > searchfile.txt
Create output folder:
$ mkdir -p selected/two
Get the selected sequences from the files:
$ find fasta_files -name '*.fas' | \
parallel "grepfasta.pl -f=searchfile.txt {} > selected/two/{/}"
Note that not all original files contains all headers!
$ get_fasta_info.pl selected/two/*.fas
Keep only files with four sequences:
$ rm -v $(get_fasta_info.pl selected/two/*.fas 2>/dev/null | \
awk '$1 != 4 {print $NF}')
Note that many/most of the alignments now contain "all-gap" columns! Those can be removed using, for example:
$ find selected/two -name '*.fas' | \
parallel "degap_fasta.pl -o={.}.degap {}"
$ find selected/two -name '*.degap' | \
parallel "mv {} {.}.fas"
$ mkdir -p selected/fas
$ echo -e "FULGL\nGALGA\nGAVST\nGEOFO" > searchfile.txt
$ find fasta_files -name '*.fas' | \
parallel "grepfasta.pl -f=searchfile.txt {} > selected/fas/{/}"
$ find selected/fas -name '*.fas' | \
parallel "degap_fasta.pl -o={.}.degap {}"
$ find selected/fas -name '*.degap' | \
parallel "mv {} {.}.fas"
$ rm -v $(get_fasta_info.pl selected/fas/*.fas 2>/dev/null | \
awk '$1 != 4 || $2 < 200 || $2 > 5000 {print $NF}')
Custom scripts are from https://github.com/Naturhistoriska/birdscanner.
SCRDIR=/path/to/birdscanner/src
$ wget ftp://parrot.genomics.cn/gigadb/pub/10.5524/101001_102000/101041/FASTA_files_of_loci_datasets/Filtered_sequence_alignments/8295_Exons/pep2cds-filtered-sate-alignments-noout.tar.gz
$ tar -I pigz -xvf pep2cds-filtered-sate-alignments-noout.tar.gz
$ cd 8000orthologs
$ mkdir -p fasta_files
$ find * -name 'sate.default.pep2cds.removed.noout.aligned' | \
parallel "cp {} fasta_files/{//}.{/}"
$ find fasta_files -type f -name '*.aligned' | \
parallel "sed -i 's/?/N/g' {}"
$ find fasta_files -type f -name '*.aligned' | \
parallel "${SRCDIR}/remove_gapped_seqs_in_fasta.pl -N {} ; rm {}"
$ find fasta_files -type f -name '*.degapped.fas' | \
parallel "${SRCDIR}/degap_fasta.pl -o={.} {} ; rm {}"
$ find fasta_files -type f -name '*.degapped' | \
parallel "sed -i '/^$/d' {}"
$ find fasta_files -type f -name '*.degapped' | \
parallel "mv {} {.}.fas"
$ wget ftp://parrot.genomics.cn/gigadb/pub/10.5524/101001_102000/101041/FASTA_files_of_loci_datasets/Filtered_sequence_alignments/2516_Introns/introns-filtered-sate-alignments-with-and-without-outgroups.tar.gz
$ tar -I pigz -xvf introns-filtered-sate-alignments-with-and-without-outgroups.tar.gz
$ cd 2500orthologs
# Intron files comes in one fasta file ("sate.removed.intron.noout.aligned-allgap.filtered"),
# with partitions defined in another ("sate.removed.intron.noout-allgap.filtered.part").
# Here we split the fasta file in parts. Output are named <part_id>.<intron_id>.fas
$ perl ${SRCDIR}/extract_part_genes.pl
$ find "fasta_files" -type f -name '*.fas' | \
parallel "sed -i 's/?/N/g' {}"
$ find fasta_files -type f -name '*.fas' | \
parallel "${SRCDIR}/remove_gapped_seqs_in_fasta.pl -N {} ; rm {}"
$ find fasta_files -type f -name '*.degapped.fas' | \
parallel "${SRCDIR}/degap_fasta.pl -o={.} {} ; rm {}"
$ find fasta_files -type f -name '*.degapped' | \
parallel "sed -i '/^$/d' {}"
$ find fasta_files -type f -name '*.degapped' | \
parallel "mv {} {.}.fas"
[1]: Jarvis ED; Mirarab S; Aberer AJ; Houde P; Li C; Ho SYW; Faircloth BC; Nabholz B; Howard JT; Suh A; Weber CC; da Fonseca RR; Alfaro-Nunez A; Narula N; Liu L; Burt DW; Ellegren H; Edwards SV; Stamatakis A; Mindell DP; Cracraft J; Braun EL; Warnow T; Wang J; Gilbert MTP; Zhang G (2014): Phylogenomic analyses data of the avian phylogenomics project. GigaScience Database. http://dx.doi.org/10.5524/101041 [2]: ftp://parrot.genomics.cn/gigadb/pub/10.5524/101001_102000/101041/FASTA_files_of_loci_datasets/Filtered_sequence_alignments/8295_Exons/pep2cds-filtered-sate-alignments-noout.tar.gz [3]: ftp://parrot.genomics.cn/gigadb/pub/10.5524/101001_102000/101041/FASTA_files_of_loci_datasets/Filtered_sequence_alignments/2516_Introns/introns-filtered-sate-alignments-with-and-without-outgroups.tar.gz