CDSKIT (/sidieskit/) is a Python program that manipulates protein-coding nucleotide sequences. This program is designed to handle DNA sequences using codons (sets of three nucleotides) as the unit, and therefore, edits the coding sequences without causing a frameshift. All sequence formats supported by Biopython are available in this tool for both inputs and outputs.
# Installation with pip
pip install git+https://github.com/kfuku52/cdskit
# This should show complete options if installation is successful
cdskit -h
See Wiki for detailed descriptions.
-
accession2fasta
: Retrieving fasta sequences from a list of GenBank accessions -
aggregate
: Extracting the longest sequences combined with a sequence name regex -
backtrim
: Back-translating a trimmed protein alignment -
hammer
: Removing less-occupied codon columns from a gappy alignment -
intersection
: Dropping non-overlapping sequence labels between two sequences files or between a sequence file and a gff file -
label
: Modifying sequence labels -
mask
: Masking ambiguous and/or stop codons -
pad
: Making nucleotide sequences in-frame by head and tail paddings -
parsegb
: Converting the GenBank format -
printseq
: Print a subset of sequences with a regex -
rmseq
: Removing a subset of sequences by using a sequence name regex and by detecting problematic sequence characters. -
split
: Splitting 1st, 2nd, and 3rd codon positions. -
stats
: Printing sequence statistics.
CDSKIT is designed for data flow through standard input and output. Streamlined processing may be combined with other sequence processing tools, such as SeqKit, with pipes (|
).
# Example
seqkit seq input.fasta.gz | cdskit pad | cdskit mask | seqkit translate | cdskit aggregate -x ":.*" > output.fasta
There is no published paper on CDSKIT itself, but we used and cited CDSKIT in several papers including Fukushima & Pollock (2023, Nat Ecol Evol 7: 155-170).
This program is BSD-licensed (3 clause). See LICENSE for details.