Skip to content

kfuku52/cdskit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

CDSKIT (/sidieskit/) is a Python program that manipulates protein-coding nucleotide sequences. This program is designed to handle DNA sequences using codons (sets of three nucleotides) as the unit, and therefore, edits the coding sequences without causing a frameshift. All sequence formats supported by Biopython are available in this tool for both inputs and outputs.

Installation

# Installation with pip
pip install git+https://github.com/kfuku52/cdskit

# This should show complete options if installation is successful
cdskit -h 

Subcommands

See Wiki for detailed descriptions.

  • accession2fasta: Retrieving fasta sequences from a list of GenBank accessions

  • aggregate: Extracting the longest sequences combined with a sequence name regex

  • backtrim: Back-translating a trimmed protein alignment

  • hammer: Removing less-occupied codon columns from a gappy alignment

  • intersection: Dropping non-overlapping sequence labels between two sequences files or between a sequence file and a gff file

  • label: Modifying sequence labels

  • mask: Masking ambiguous and/or stop codons

  • pad: Making nucleotide sequences in-frame by head and tail paddings

  • parsegb: Converting the GenBank format

  • printseq: Print a subset of sequences with a regex

  • rmseq: Removing a subset of sequences by using a sequence name regex and by detecting problematic sequence characters.

  • split: Splitting 1st, 2nd, and 3rd codon positions.

  • stats: Printing sequence statistics.

Streamlined analysis

CDSKIT is designed for data flow through standard input and output. Streamlined processing may be combined with other sequence processing tools, such as SeqKit, with pipes (|).

# Example 
seqkit seq input.fasta.gz | cdskit pad | cdskit mask | seqkit translate | cdskit aggregate -x ":.*"  > output.fasta

Citation

There is no published paper on CDSKIT itself, but we used and cited CDSKIT in several papers including Fukushima & Pollock (2023, Nat Ecol Evol 7: 155-170).

Licensing

This program is BSD-licensed (3 clause). See LICENSE for details.

About

Processing protein-coding DNA sequences in frame

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages