Skip to content
shiquan edited this page Nov 9, 2021 · 42 revisions

PISA is a suite of programs for processing and interacting with single-cell/-molecular high-throughput sequencing fastq and bam file. The idea of PISA is trying to integrate cell barcodes and molecular barcodes (UMIs) into plain fastq files. Users could perform quality control, alignment, or assembly with the current-stat-of-art software for the fastqs. And further, use PISA to parse the barcodes as tag information in the bam file. Users can also use PISA to perform selection, correction, and summary tags of bams. So PISA is flexible and NOT designed for a specific library or platform.

INSTALL

$ git clone https://github.com/shiquan/PISA
$ cd PISA
$ make

SYNOPSIS

Parse and put all barcodes information into the read names of the FASTQ+ files. The FASTQ+ is a variant of the standard FASTQ format, can be used like FASTQ.

PISA parse -config read_structure.json -1 reads.fq -report fastq_report.csv reads_1.fq.gz reads_2.fq.gz

Convert the barcodes in the read name to tags of the SAM file and export into BAM

PISA sam2bam -report alignment_report.csv in.sam -o out.bam

Annotate reads and add annotation tags in the BAM

PISA anno -gtf refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf -o anno.bam -@ 5 -report anno_report.csv aln.bam

Correct barcodes in each block of reads, like the reads have the same cell barcode and gene tags

PISA corr -tag UR -new-tag UB -tags-block CB,GN -cr -o final.bam -@ 5 anno.bam

Count gene expression based on cell barcode, gene and UMI tags

PISA count -tag CB -anno-tag GN -umi UB -outdir raw_gene_expression -@ 5 final.bam

TODO list

  • Support loom output
  • Export unspliced matrix for velocity
  • Upgrade PISA parse for faster process fastqs.

CHANGLOG

  • v0.10a 2021/11/06

    • PISA count support count spliced and unspliced reads by using -ttype option.
    • PISA count support count from multiple bam files.
  • v0.9 2021/10/14

    • Rewrite rmdup. Not support paired reads for now.
  • v0.8 2021/07/20

    • Reduce memory usage of count
    • Fix region query bug of anno -bed
    • Add anno -vcf method
  • v0.7 2020/11/20

    • Introduce the PCR deduplicate method rmdup.
    • Mask read and qual field as * instead of sequence for secondary alignments in the BAM file.
  • v0.6 2020/10/29

    • PISA attrcnt, Skip secondary alignments before counting reads
    • PISA anno fix segments fault bugs when loading malformed GTF
  • v0.5 2020/08/27

    • Add PISA bam2frag function (experimental).
    • PISA anno Skip secondary alignments when counting total reads.
  • v0.4 2020/07/14

    • PISA sam2bam add mapping quality adjustment method
    • rewrite UMI correction index structure to reduce memory use
    • Fix bugs.
  • v0.4alpha 2020/05/2

    • PISA anno use UCSC bin scheme instead of linear search for reads query gene regions. Fix the bug of misannotated antisense reads.
    • PISA count use MEX output instead of plain cell vs gene table.
  • v0.3 2020/03/26

    • Fix bugs and improve preformance.
  • 0.0.0.9999 2019/05/19

    • Init.
Clone this wiki locally