Skip to content

Filter multi fasta files to keep only the longest gene model

Notifications You must be signed in to change notification settings

mwylerCH/MultiFastaReduceR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

MultiFastaDemultiplexer

CDS or protein fasta files from annotation often contain multiple gene version for the same locus. For various application only one single gene model is needed. MultiFastaReduceR subsets a input fasta and keeps only the longest annotated sequence.

Usage

Dependencies

# On R
install.packages("tidyverse")
install.packages("seqinr")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")

Download

git clone https://github.com/mwylerCH/MultiFastaDemultiplexer
chmod +x MultiFastaDemultiplexer/MultiFastaReduceR
echo 'export PATH="$HOME/MultiFastaDemultiplexer:$PATH"' >> ~/.bashrc

Subset fasta

MultiFastaReduceR [MOLECULE] [OverAnnotatedFASTA.fa]

MOLECULE is a string: DNA for nucleotide sequence or AA for proteins.
OverAnnotatedFASTA.fa is the original multifasta file. The file can be compressed (gzip) or not. Fasta header can contain gene description or other information, but the different gene model versions need to be named as follow: "NC_000019.1", "NC_000019.2", "NC_000019.3",...

Technical

Developed with R version 3.5.1 and Biostrings version 2.50.2, seqinr 3.6-1 and tidyverse 1.3.0 on a Ubuntu 16.06 LTS machine.
22th August 2020, Giubiasco, Switzerland.

About

Filter multi fasta files to keep only the longest gene model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages