Skip to content
shiquan edited this page Mar 13, 2022 · 42 revisions

PISA is a suite of programs for processing and interacting with single-cell/-molecular high-throughput sequencing fastq and bam file. The idea of PISA is trying to integrate cell barcodes and molecular barcodes (UMIs) into plain fastq files. Users could perform quality control, alignment, or assembly with the current-stat-of-art software for the fastqs. And further, use PISA to parse the barcodes as tag information in the bam file. Users can also use PISA to perform selection, correction, and summary tags of bams. So PISA is flexible and NOT designed for a specific library or platform.

INSTALL

Prerequisite: from v0.11a, openmp is required to install in your system.

$ git clone https://github.com/shiquan/PISA
$ cd PISA
$ make

How to cite

PISA paper is still under review, at this moment you can cite the website (https://github.com/shiquan/PISA/) in your manuscript.

SYNOPSIS

Parse and put all barcodes information into the read names of the FASTQ+ files. The FASTQ+ is a variant of the standard FASTQ format, can be used like FASTQ.

PISA parse -config read_structure.json -1 reads.fq -report fastq_report.csv reads_1.fq.gz reads_2.fq.gz

Convert the barcodes in the read name to tags of the SAM file and export into BAM

PISA sam2bam -report alignment_report.csv in.sam -o out.bam

Annotate reads and add annotation tags in the BAM

PISA anno -gtf genes.gtf -o anno.bam -@ 5 -report anno_report.csv aln.bam

Correct barcodes in each block of reads. In this example, the block defined as the reads has the same cell barcode and gene tags.

PISA corr -tag UR -new-tag UB -tags-block CB,GN -cr -o final.bam -@ 5 anno.bam

Count gene expression based on cell barcode, gene and UMI tags.

PISA count -tag CB -anno-tag GN -umi UB -outdir raw_gene_expression -@ 5 final.bam

TODO list

  • Implement parse strategy for cell hash and CITEseq.
  • Assemble reads original from one molecule;
  • Implement new designed and more user-friendly parse;
  • Support loom output (frozen);
  • Export unspliced matrix for velocity;
  • Upgrade PISA parse for faster process fastqs.

CHANGLOG

v0.11a 2022/03/13
  • Add counts2, count peaks X cells matrix from the fragment file.
v0.10 2022/01/06
  • Update bam2frag, export a fragment file compatible with 10X cellranger-ATAC.
v0.10b 2021/12/09
  • PISA count now has -velo option to export unspliced and spliced matrix together. For velocity analysis, remember to use -intron to annotate reads.
  • PISA parse support multi-threads.
v0.10a 2021/11/06
  • PISA count support count spliced and unspliced reads.
  • PISA count support count from multiple bam files.
v0.9 2021/10/14
  • Rewrite rmdup. Not support paired reads for now.
v0.8 2021/07/20
  • Reduce memory usage of count
  • Fix region query bug of anno -bed
  • Add anno -vcf method
v0.7 2020/11/20
  • Introduce the PCR deduplicate method rmdup.
  • Mask read and qual field as * instead of sequence for secondary alignments in the BAM file.
v0.6 2020/10/29
  • PISA attrcnt, Skip secondary alignments before counting reads
  • PISA anno fix segments fault bugs when loading malformed GTF
v0.5 2020/08/27
  • Add PISA bam2frag function (experimental).
  • PISA anno Skip secondary alignments when counting total reads.
v0.4 2020/07/14
  • PISA sam2bam add mapping quality adjustment method;
  • Rewrite UMI correction index structure to reduce memory use;
  • Fix bugs.
v0.4alpha 2020/05/2
  • PISA anno use UCSC bin scheme instead of linear search for reads query gene regions. Fix the bug of misannotated antisense reads.
  • PISA count use MEX output instead of plain cell vs gene table.
v0.3 2020/03/26 Fix bugs and improve preformance.
0.0.0.9999 2019/05/19 Init version.
Clone this wiki locally