Welcome to the LOng-read General github repo!
This is a GitHub repo of (mostly independent) Python/R scripts that I developed to analyse data from long-read sequencing experiments. Purpose of scripts vary from generating txt files to run community tools (example pipelines), generating plots post-SQANTI, running differential expression analyses to more custom applications.
A pipeline for processing raw ONT reads from transcriptome cDNA processing, using research community tools (i.e. Porechop,Minimap2,SQANTI3) and own custom scripts.
Below listed are features that can be explored on <sample>_classification.txt
generated from SQANTI.
- number of isoforms by structural category
- correlate exon number, gene length with isoform number
- identify long-non-coding RNA isoforms
- plot and test the number of isoforms with/without certain features (i.e. within/without 50bp of CAGE peak/TSS/TTS)
To run functions, read in <sample>_classification.txt
file using:
SQANTI_class_preparation(<sample>_classification.txt, standard)
if expression columns are included in the file (after running --FL_count in SQANTI)SQANTI_class_preparation(<sample>_classification.txt, nstandard)
if expression is not included
subset_targetgenes_classfiles.py
: Subset SQANTI classification file based on genes and readscolour_transcripts_by_countandpotential.py
: Colour bed file by abundance and coding potentialextract_fasta_bestorf.py
: Create a fasta file based on best ORF defined from CPAT
Current script dump to maintain. Scripts to input results after running tappAS, running linear regression etc...
replace_filenames_with_csv.py
: Replace multiple file names in a directory using reference csv filesearch_fasta_by_sequence.py
: Subset fasta based on sequencesubset_fasta_gtf.py
: Subset gtf, fasta and bed files based on list of transcript IDs