CDS or protein fasta files from annotation often contain multiple gene version for the same locus. For various application only one single gene model is needed. MultiFastaReduceR subsets a input fasta and keeps only the longest annotated sequence.
# On R
install.packages("tidyverse")
install.packages("seqinr")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Biostrings")
git clone https://github.com/mwylerCH/MultiFastaDemultiplexer
chmod +x MultiFastaDemultiplexer/MultiFastaReduceR
echo 'export PATH="$HOME/MultiFastaDemultiplexer:$PATH"' >> ~/.bashrc
MultiFastaReduceR [MOLECULE] [OverAnnotatedFASTA.fa]
MOLECULE is a string: DNA for nucleotide sequence or AA for proteins.
OverAnnotatedFASTA.fa is the original multifasta file. The file can be compressed (gzip) or not. Fasta header can contain gene description or other information, but the different gene model versions need to be named as follow: "NC_000019.1", "NC_000019.2", "NC_000019.3",...
Developed with R version 3.5.1 and Biostrings version 2.50.2, seqinr 3.6-1 and tidyverse 1.3.0 on a Ubuntu 16.06 LTS machine.
22th August 2020, Giubiasco, Switzerland.