Merge pull request #6 from PacificBiosciences/update_v0.2.2

Update v0.2.2
PacificBiosciences · Jun 30, 2023 · ffead68 · ffead68
2 parents b43a537 + 59ca2e1
commit ffead68
Show file tree

Hide file tree

Showing 2 changed files with 116 additions and 110 deletions.
diff --git a/README.md b/README.md
@@ -15,6 +15,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
 3. [Output](#output)
 4. [Examples](#examples)
 5. [Accessory scripts](#accessory)
+6. [What's up with all the immune genes?](#immune)
 
 
 ## Install <a name="install"></a>
@@ -27,13 +28,13 @@ Binaries are also availible in the github releases.
 
 ## Usage <a name="usage"></a>
 
-`pbfusion` has two primary executables: `pbfusion` and `gffcache`.
+`pbfusion` has two subcommands: `pbfusion discover` and `pbfusion gff-cache`.
 
-`gffcache` is not required, but recommended when running `pbfusion` multiple times. `gffcache` will serialize the input GTF/GFF file and preprocess into exonic intervals ahead of time, which is fairly slow to do on the fly.
+`pbfusion gff-cache` is not required, but recommended when running `pbfusion` multiple times. `pbfusion gff-cache` will serialize the input gtf/gff file and preprocess into exonic intervals ahead of time, which is fairly slow to do on the fly.
 You can find GENCODE annotation files [here](https://www.gencodegenes.org/human/release_38.html).
 
 ```
-Usage: gffcache [OPTIONS] --gtf <ReferenceAnnotation>
+Usage: pbfusion gff-cache [OPTIONS] --gtf <ReferenceAnnotation>
 
 Options:
   -g, --gtf <ReferenceAnnotation>            Input GTF file
@@ -50,74 +51,67 @@ Options:
 3. The output prefix. `pbfusion` writes multiple files, prefixed with the user specified string.
 
 ```
-A tool for detecting fusion genes in aligned PacBio RNA data
+Identify fusion genes in aligned PacBio Iso-Seq data
 
-Usage: pbfusion [OPTIONS] --bam <FILE> --reference-gtf <REF> --output-prefix <OUTPUT>
+Usage: pbfusion discover [OPTIONS] --bam <FILE> --gtf <REF> --output-prefix <OUTPUT>
 
 Options:
   -b, --bam <FILE>
-          Aligned FLNC or transcript BAM file
-  -g, --reference-gtf <REF>
-          Input annotations in gtf format
+          Aligned Iso-Seq data in BAM format
+  -g, --gtf <REF>
+          Reference gene annotations in GTF format
   -o, --output-prefix <OUTPUT>
-          Output bed file prefix [default: none]
-  -d, --breakpoint-group-distance <MAX_CLUST_DIST>
-          Maximum allowed distance (in bp) for clustering breakpoints [default: 500]
+          Output prefix [default: none]
   -t, --threads <THREADS>
           Number of threads [default: 1]
-  -m, --min-coverage <MIN_COVERAGE>
-          Filter out breakpoint groups with coverage lower than this number of reads (exclusive) [default: 2]
+  -c, --min-coverage <MIN_COVERAGE>
+          Assigns "low confidence" to fusion calls with read coverage below the minimum coverage threshold [default: 2]
+  -i, --min-mean-identity <MIN_MEAN_IDENTITY>
+          Assigns "low confidence" to fusion calls where the mean alignment identity iew below the threshold [default: 0.85]
+  -p, --min-mean-mapq <MIN_MEAN_MAPQ>
+          Assigns "low confidence" to fusion calls where the mean mapq is below the threshold [default: 0.]
+  -M, --min-fusion-read-fraction <MIN_FUSION_READ_FRACTION>
+          Remove breakpoint pairs from groups if they have gene alignments which fewer than <arg> reads in group have [default: 0.25]
+  -s, --max-variability <MAX_VARIABILITY>
+          Assigns "low confidence" to fusion calls with the mean breakpoint distance is above the threshold [default: 1000]
+  -a, --max-readthrough <MAX_READTHROUGH>
+          Assigns "low confidence" to fusion calls spanning two genes below the readthrough threshold [default: 100000]
+  -r, --real-cell-filtering
+          Real-cell filtering for single-cell data. Iso-Seq reads annotated with zero "rc" tag value will be filtered
+  -l, --min-segment-length <MIN_SEGLEN>
+          Minimum alignment length, alignments below the threshold will be filtered before clustering [default: 25]
+  -Q, --min-fusion-quality <MIN_FUSION_QUALITY>
+          Determine the minimum fusion quality to emit. Choices: must be either LOW or MEDIUM [default: MEDIUM]
   -v, --verbose...
-          Verbose output
-          Apply once for info-level messages and additional output information
-          Apply twice or more for debug logs
-          If --log-level is set, this option takes precedence.
+          Enable verbose output. Enable verbose output
       --log-level <LOG_LEVEL>
           Alternative to repeated -v/--verbose: set log level via key.
           Values: "error", "warn" (default), "info", "debug", "trace".
           Enabling any level higher than "warn" also emits verbose output, including extra output files.
           If -v/--verbose is set, this option is ignored.
           Equivalence to -v/--verbose:
-                => "warn"
-             -v => "info"
-            -vv => "debug"
-           -vvv => "trace" [default: error]
-  -T, --fusion-readthrough-threshold <fusion_readthrough_threshold>
-          Distance to use to determine if a transcript aligned to two nearby genes on the same strand
-          is a read-through event or a canonical fusion. [default: 10000]
-  -F, --filter-groups-failing-any
-          Filter out groups with breakpoint pairs failing any category.
-          Standard behavior is to filter out groups with pairs failing all categories, which is more permissive.
-  -R, --emit-readthrough
-          To emit ReadThrough events. These are not emitted by default
-  -S, --emit-sense-antisense
-          To emit SenseAntisense events. These are not emitted by default
-  -A, --emit-unannotated-exons
-          To emit unannotated exon events. These are not emitted by default
-  -O, --emit-overlapping
-          To emit overlapping gene events. These are not emitted by default
-  -a, --emit-all-breakpoints
-          Emit all breakpoints. Enables --emit-unannotated-exons, --emit-sense-antisense, --emit-overlapping, and --emit-readthrough
-  -n, --novel-exon-discovery
-          Whether to discover "novel exons"
+                => "WARN"
+             -v => "INFO"
+            -vv => "DEBUG"
+           -vvv => "TRACE" [default: error]
   -h, --help
-          Print help information (use `--help` for more detail)
+          Print help (see more with '--help')
   -V, --version
-          Print version information
+          Print version
 
 Copyright (C) 2004-2023     Pacific Biosciences of California, Inc.
 This program comes with ABSOLUTELY NO WARRANTY; it is intended for
 Research Use Only and not for use in diagnostic procedures.
 ```
 
-By default, readthrough and sense-antisense chimeras, and overlapping gene events, and events involving unannotated exons are filtered out.
+By default, readthrough and sense-antisense chimeras, and overlapping gene events, and events involving unannotated exons are marked as low-quality.
 
-They can be emitted by enabling relevant `--emit-{category}` flags, or all can be enabled with `-a/--emit-all-breakpoints`.
+They can be unfiltered by enabling `--emit-all-breakpoints.`
 
 
 ## Output <a name="output"></a>
 
-`pbfusion` produces **one output file designed for end users**: `{prefix}.breakpoints.groups.bed`. All other files are auxiliary and usually used for diagnostic purposes.
+`pbfusion discover` produces **one output file designed for end users**: `{prefix}.breakpoints.groups.bed`. All other files are auxiliary and usually used for diagnostic purposes.
 
 If verbose output is enabled (`-v`), five output files sharing the same prefix are emitted.
 
@@ -136,32 +130,42 @@ If verbose output is enabled (`-v`), five output files sharing the same prefix a
 The clustered breakpoint call file is [BEDPE](https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format) file formatted including header lines.
 
 Notes:
-  - The cluster score is set to a constant "MEDIUM"
+  - The cluster score is set to "MEDIUM" unless the group is considered low-quality, in which case it is "LOW"
+    - Scores may be set to "LOW" depending on:
+      - Low coverage
+      - Low alignment identity
+      - High breakpoint variability
   - You can `grep` the output for specific gene names or gene IDs if you're fishing for a specific fusion gene pair
 
 
 ### Filtering options
 
-The primary filtering options to reduce false positives are `--min-coverage` (`-m`) and `--fusion-readthrough-threshold` (`-T`).
+Filtering is now primarily done through adjusting the allowed score [default "MEDIUM"].
+The primary filtering options to reduce false positives are `--min-coverage` (`-c`), `--max-readthrough` (`-a`), `--min-mean-identity` (`-i`), `--min-mean-mapq` (`-p`), `--min-fusion-read-fraction` (`-M`), and `--max-variability` (`-s`).
 `--min-coverage` just filters out breakpoints based on their read support, where the default value is 2 to filter out singletons.
-`--fusion-readthrough-threshold` is used to discard reads that align to two genes next to each other in the genome.
-By default, this is set to a modest 10kb, but for more stringent fusion gene calling, we recommend setting this option to 100kb.
-Another option to use for more stringent fusion gene calling is `--filter-groups-failing-any`.
-This means that breakpoint pairs that fail any of our internal checks (readthrough, strand switch, unannotated exons, or overlapping genes), that pair will not get emitted.
+`--max-readthrough` is used to discard reads that align to two genes next to each other in the genome [default 100kb].
+`--min-mean-identity` will assign low confidence to fusions with mean mapping identity lower than the threshold [0.85]. Sometimes in hard to map regions of the genome, the aligner will force an alignment through a region and incur a high edit distance.
+`--min-mean-mapq` default is set to `0.` because even with high alignment identity, some genes have close relatives where the aligner will assign a MAPQ of 0.
+`--min-fusion-read-fraction` is used to filter long chains of genes.
+An example of this would be an IG alignment, where maybe 100 reads align to various annotated regions (eg. IGHA1, IGHV3-23, IGHV3-7, IGHG3, IGHJ5, IGHJ4, IGHJ3, IGHGP, IGHJ2, IGHJ6).
+Given a read coverage of these genes like [100, 100, 90, 90, 20, 10, 10, 5, 5, 5], we would by default filter genes with coverage lower than 25% of the total read count for this fusion.
+With that filter, you're left with this coverage: [100, 100, 90, 90], which with the read filtering brings it down to [80, 80, 70, 70].
+Because this fusion still has >3 genes in it, it would get filtered out.
+`--max-variability` allows you to filter based on breakpoint variability [default 1000].
 
 
 ## Examples <a name="examples"></a>
 
 Serializing the input gtf file:
 ```
-gffcache \
+pbfusion gff-cache \
     --gtf gencode.v38.annotation.gtf \
     --gtf-out gencode.v38.annotation.gtf.bin
 ```
 
 Running `pbfusion` on aligned reads:
 ```
-pbfusion \
+pbfusion discover \
     --bam isoseq.aligned.bam \
     --reference-gtf gencode.v38.annotation.gtf.bin \
     --output-prefix isoseq \
@@ -192,6 +196,7 @@ Visualization dependencies:
 
 
 ### Extract tag
+**This script is deprecated now that cell barcodes are emitted automatically**
 The main use case for this script is to output associated cell barcodes for fusion gene reads.
 `extract_tag.py` will take a BAM file, BAM tag, and a list of read names (can be taken from the output BEDPE and edited for one readname per line).
 The output is tab delimited with the read name and its associated cell barcode.
@@ -206,10 +211,46 @@ python3 extract_tag.py \
     >cell_associations.tsv
 ```
 
-This script will be deprecated once cell barcodes are emitted automatically.
 
-## Changelog 
-* v0.1.0 - initial release.
+## What's up with all the immune genes? <a name="immune"></a>
+If you see a lot of immune genes in your output (eg. IGH, IGL, IGK, TRA, TRB), this is completely normal.
+These genes are recombined at the genomic level (VDJ or VJ recombination), so aside from filtering the gene names, these will be detected as fusions.
+
+
+### Changelog
+Changelog - PacBio Fusion Detection - pbfusion
+
+## v0.2.2 6/29/23
+### Changes
+- Command-line interface, filtering, and formatting changes.
+- Option to filter by average mapping quality on either side of a breakpoint.
+- Option to filter reads from groups based on gene coverage (as a %-age of total reads for that group).
+- Filters out entries where the number of genes is 1 or >3.
+- We now mark events by quality, and then filter by quality levels instead of directly filtering events.
+  These events can be emitted with `--min-fusion-quality LOW` instead of the default `MEDIUM`.
+- Simplified output. Fewer categories, simpler identifiers, and a new header format that is easier to parse.
+- Number of reads filter now checks the `im` bam tag for clustered input. Now, the number of original reads is used for filtering instead of the number of clusters.
+
+## v0.2.1: 6/8/23
+### Changes
+- Add options to filter segments by length and error rate.
+
+## v0.2.0: 5/22/23
+### Changes
+- Join pbfusion and pbgffcache into one executable `pbfusion` with subcommands `pbfusion discover` and `pbfusion gff-cache` corresponding to the previous.
+
+## v0.1.2: 5/19/23
+### Changes
+- Add edit distance per segment as a feature of Interval for filtering.
+
+## v0.1.1: 4/26/23
+### Changes
+- Add cell barcodes if found, and original read names if reads are clustered to output file. The first tag set in `CB`, `XC`, and `CR` is used.
+- Add `--max-mean-breakpoint-distance` flag, which marks breakpoint groups as low quality if the mean distance of breakpoints in a cluster exceeds this threshold.
+
+## v0.1.0
+### Changes
+- Initial release
 
 ## Disclaimer
 

diff --git a/scripts/visualize_fusion.py b/scripts/visualize_fusion.py
@@ -56,25 +56,21 @@ def read_bedpe(bedpe):
     '''
     with open(bedpe) as f:
         for line in f:
-            if line.startswith('#') or not line.rstrip():
+            if line.startswith('#'):
                 continue
             sl = line.rstrip().split('\t')
             breakpoints = ((sl[0], sl[1]), (sl[3], sl[4]))
             reads = set(sl[11][6:].split(','))
             aux = sl[10].split(';')
             for field in aux:
-                if field.startswith('GENE_IDS='):
+                if field.startswith('GI='):
                     gene_ids = field[9:].split(',')
-                if field.startswith('GENE_NAMES='):
-                    gene_names = field[11:].split(',')
-    gene_ids = list(zip(gene_ids, gene_names))
     return reads, gene_ids, breakpoints
 
 
 def read_gtf(gtf, gene_ids):
-    gids = set([x[0] for x in gene_ids])
-    id_to_name = {x[0]: x[1] for x in gene_ids}
-    gtf_dict, transcript_list, tid_to_gname = {}, [], {}
+    gtf_dict = {}
+    transcript_list = []
     with open(gtf) as f:
         for line in f:
             if line.startswith('#'):
@@ -87,9 +83,8 @@ def read_gtf(gtf, gene_ids):
                 end = int(a[4])
                 transcript = a[8].split(' transcript_id "')[1].split('"')[0]
                 gene = a[8].split('gene_id "')[1].split('"')[0]
-                if gene not in gids:
+                if gene not in gene_ids:
                     continue
-                tid_to_gname[transcript] = id_to_name[gene]
                 if transcript not in gtf_dict:
                     gtf_dict[transcript] = []
                 gtf_dict[transcript].append([chromosome, start, end, type1])
@@ -105,8 +100,7 @@ def read_gtf(gtf, gene_ids):
             types.append(part[3])
         transcript_list.append([
             chromosome, min(starts), max(ends),
-            blockstarts, blockwidths, False, types, tid_to_gname[transcript]]
-        )
+            blockstarts, blockwidths, False, types])
     return transcript_list
 
 
@@ -129,14 +123,10 @@ def read_bam(bam, reads):
 
 
 def plot_fusion(ref_genes, alignments, breakpoints, output):
-    alignments = {k: sorted(v) for (k, v) in alignments.items() if len(v) == 2}
-    if breakpoints[0][0] == breakpoints[1][0]:
-        same_chr = True
-    else:
-        same_chr = False
-
+    alignments = {k: v for (k, v) in alignments.items() if len(v) == 2}
+
     gene1_start = min([x[0][1][0][0] for x in list(alignments.values()) if x[0][0] == breakpoints[0][0]])
-    gene2_end = max([x[1][-1][-1][0] + x[1][-1][-1][1] for x in list(alignments.values()) if x[0][0] == breakpoints[0][0]])
+    gene2_end = max([x[2] for x in ref_genes if x[0] == breakpoints[1][0]])
     gene1_span = abs(int(breakpoints[0][1]) - gene1_start)
     gene2_span = abs(int(breakpoints[1][1]) - gene2_end)
     total_span = gene1_span + gene2_span
@@ -155,20 +145,7 @@ def plot_fusion(ref_genes, alignments, breakpoints, output):
     # plot the gene annotations
     g1_y, g2_y = 0, 0
     for transcript in ref_genes:
-        if same_chr:
-            bp1_dist = abs(int(breakpoints[0][1]) - transcript[1])
-            bp2_dist = abs(int(breakpoints[1][1]) - transcript[1])
-            if bp1_dist < bp2_dist:
-                panel = gene1
-                g1_y += 1
-                ypos = g1_y
-                color = orange
-            else:
-                panel = gene2
-                g2_y += 1
-                ypos = g2_y
-                color = blue
-        elif transcript[0] == breakpoints[0][0]:
+        if transcript[0] == breakpoints[0][0]:
             panel = gene1
             g1_y += 1
             ypos = g1_y
@@ -178,7 +155,6 @@ def plot_fusion(ref_genes, alignments, breakpoints, output):
             g2_y += 1
             ypos = g2_y
             color = blue
-        panel.set_ylabel(transcript[7], fontsize=4)
         start, stop = transcript[1], transcript[2]
         panel.plot([start, stop], [ypos]*2, lw=0.2, c=color, zorder=10)
         for i in range(len(transcript[3])):
@@ -194,41 +170,30 @@ def plot_fusion(ref_genes, alignments, breakpoints, output):
     for read in list(alignments.values()):
         y += 1
         for locus in read:
-            if same_chr:
-                bp1_dist = abs(int(breakpoints[0][1]) - locus[1][0][0])
-                bp2_dist = abs(int(breakpoints[1][1]) - locus[1][0][0])
-                if bp1_dist < bp2_dist:
-                    panel = reads1
-                    color = orange
-                    start, stop = locus[1][0][0], int(breakpoints[0][1])
-                else:
-                    panel = reads2
-                    color = blue
-                    start, stop = int(breakpoints[1][1]), locus[1][-1][0] + locus[1][-1][1]
-            elif locus[0] == breakpoints[0][0]:
+            if locus[0] == breakpoints[0][0]:
                 panel = reads1
                 color = orange
                 start, stop = locus[1][0][0], int(breakpoints[0][1])
             elif locus[0] == breakpoints[1][0]:
                 panel = reads2
                 color = blue
-                start, stop = int(breakpoints[1][1]), locus[1][-1][0] + locus[1][-1][1]
+                start, stop = locus[1][0][0], locus[1][-1][0] + locus[1][-1][1]
             for block in locus[1]:
                 left, width = block
                 exon = Rect((left, y-0.25), width, 0.5, lw=0, fc=color, zorder=12)
                 panel.add_patch(exon)
             panel.plot([start, stop], [y]*2, lw=0.2, c=color, zorder=10)
 
-    padding = 500
-    gene1.set_xlim(gene1_start-padding, int(breakpoints[0][1])+gene2_span)
-    gene2.set_xlim(int(breakpoints[1][1])-gene1_span, gene2_end+padding)
-    gene1.set_ylim(0, g1_y+1)
-    gene2.set_ylim(0, g2_y+1)
+    gene1.set_ylabel('ASPSCR1', fontsize=5)
+    gene2.set_ylabel('TFE3', fontsize=5)
+    reads1.set_ylabel('Reads')
+
+    gene1.set_xlim(gene1_start, int(breakpoints[0][1])+gene2_span)
+    gene2.set_xlim(int(breakpoints[1][1])-gene1_span, gene2_end)
     reads1.set_ylim(0, len(alignments)+1)
     reads2.set_ylim(0, len(alignments)+1)
-    reads1.set_xlim(gene1_start-padding, int(breakpoints[0][1]))
-    reads2.set_xlim(int(breakpoints[1][1]), gene2_end+padding)
-    reads1.set_ylabel('Reads')
+    reads1.set_xlim(gene1_start, int(breakpoints[0][1]))
+    reads2.set_xlim(int(breakpoints[1][1]), gene2_end)
 
     reads1.spines['right'].set_visible(False)
     reads2.spines['left'].set_visible(False)