Skip to content

Commit

Permalink
Merge pull request #6 from PacificBiosciences/update_v0.2.2
Browse files Browse the repository at this point in the history
Update v0.2.2
  • Loading branch information
velociroger-pb authored Jun 30, 2023
2 parents b43a537 + 59ca2e1 commit ffead68
Show file tree
Hide file tree
Showing 2 changed files with 116 additions and 110 deletions.
151 changes: 96 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
3. [Output](#output)
4. [Examples](#examples)
5. [Accessory scripts](#accessory)
6. [What's up with all the immune genes?](#immune)


## Install <a name="install"></a>
Expand All @@ -27,13 +28,13 @@ Binaries are also availible in the github releases.

## Usage <a name="usage"></a>

`pbfusion` has two primary executables: `pbfusion` and `gffcache`.
`pbfusion` has two subcommands: `pbfusion discover` and `pbfusion gff-cache`.

`gffcache` is not required, but recommended when running `pbfusion` multiple times. `gffcache` will serialize the input GTF/GFF file and preprocess into exonic intervals ahead of time, which is fairly slow to do on the fly.
`pbfusion gff-cache` is not required, but recommended when running `pbfusion` multiple times. `pbfusion gff-cache` will serialize the input gtf/gff file and preprocess into exonic intervals ahead of time, which is fairly slow to do on the fly.
You can find GENCODE annotation files [here](https://www.gencodegenes.org/human/release_38.html).

```
Usage: gffcache [OPTIONS] --gtf <ReferenceAnnotation>
Usage: pbfusion gff-cache [OPTIONS] --gtf <ReferenceAnnotation>
Options:
-g, --gtf <ReferenceAnnotation> Input GTF file
Expand All @@ -50,74 +51,67 @@ Options:
3. The output prefix. `pbfusion` writes multiple files, prefixed with the user specified string.

```
A tool for detecting fusion genes in aligned PacBio RNA data
Identify fusion genes in aligned PacBio Iso-Seq data
Usage: pbfusion [OPTIONS] --bam <FILE> --reference-gtf <REF> --output-prefix <OUTPUT>
Usage: pbfusion discover [OPTIONS] --bam <FILE> --gtf <REF> --output-prefix <OUTPUT>
Options:
-b, --bam <FILE>
Aligned FLNC or transcript BAM file
-g, --reference-gtf <REF>
Input annotations in gtf format
Aligned Iso-Seq data in BAM format
-g, --gtf <REF>
Reference gene annotations in GTF format
-o, --output-prefix <OUTPUT>
Output bed file prefix [default: none]
-d, --breakpoint-group-distance <MAX_CLUST_DIST>
Maximum allowed distance (in bp) for clustering breakpoints [default: 500]
Output prefix [default: none]
-t, --threads <THREADS>
Number of threads [default: 1]
-m, --min-coverage <MIN_COVERAGE>
Filter out breakpoint groups with coverage lower than this number of reads (exclusive) [default: 2]
-c, --min-coverage <MIN_COVERAGE>
Assigns "low confidence" to fusion calls with read coverage below the minimum coverage threshold [default: 2]
-i, --min-mean-identity <MIN_MEAN_IDENTITY>
Assigns "low confidence" to fusion calls where the mean alignment identity iew below the threshold [default: 0.85]
-p, --min-mean-mapq <MIN_MEAN_MAPQ>
Assigns "low confidence" to fusion calls where the mean mapq is below the threshold [default: 0.]
-M, --min-fusion-read-fraction <MIN_FUSION_READ_FRACTION>
Remove breakpoint pairs from groups if they have gene alignments which fewer than <arg> reads in group have [default: 0.25]
-s, --max-variability <MAX_VARIABILITY>
Assigns "low confidence" to fusion calls with the mean breakpoint distance is above the threshold [default: 1000]
-a, --max-readthrough <MAX_READTHROUGH>
Assigns "low confidence" to fusion calls spanning two genes below the readthrough threshold [default: 100000]
-r, --real-cell-filtering
Real-cell filtering for single-cell data. Iso-Seq reads annotated with zero "rc" tag value will be filtered
-l, --min-segment-length <MIN_SEGLEN>
Minimum alignment length, alignments below the threshold will be filtered before clustering [default: 25]
-Q, --min-fusion-quality <MIN_FUSION_QUALITY>
Determine the minimum fusion quality to emit. Choices: must be either LOW or MEDIUM [default: MEDIUM]
-v, --verbose...
Verbose output
Apply once for info-level messages and additional output information
Apply twice or more for debug logs
If --log-level is set, this option takes precedence.
Enable verbose output. Enable verbose output
--log-level <LOG_LEVEL>
Alternative to repeated -v/--verbose: set log level via key.
Values: "error", "warn" (default), "info", "debug", "trace".
Enabling any level higher than "warn" also emits verbose output, including extra output files.
If -v/--verbose is set, this option is ignored.
Equivalence to -v/--verbose:
=> "warn"
-v => "info"
-vv => "debug"
-vvv => "trace" [default: error]
-T, --fusion-readthrough-threshold <fusion_readthrough_threshold>
Distance to use to determine if a transcript aligned to two nearby genes on the same strand
is a read-through event or a canonical fusion. [default: 10000]
-F, --filter-groups-failing-any
Filter out groups with breakpoint pairs failing any category.
Standard behavior is to filter out groups with pairs failing all categories, which is more permissive.
-R, --emit-readthrough
To emit ReadThrough events. These are not emitted by default
-S, --emit-sense-antisense
To emit SenseAntisense events. These are not emitted by default
-A, --emit-unannotated-exons
To emit unannotated exon events. These are not emitted by default
-O, --emit-overlapping
To emit overlapping gene events. These are not emitted by default
-a, --emit-all-breakpoints
Emit all breakpoints. Enables --emit-unannotated-exons, --emit-sense-antisense, --emit-overlapping, and --emit-readthrough
-n, --novel-exon-discovery
Whether to discover "novel exons"
=> "WARN"
-v => "INFO"
-vv => "DEBUG"
-vvv => "TRACE" [default: error]
-h, --help
Print help information (use `--help` for more detail)
Print help (see more with '--help')
-V, --version
Print version information
Print version
Copyright (C) 2004-2023 Pacific Biosciences of California, Inc.
This program comes with ABSOLUTELY NO WARRANTY; it is intended for
Research Use Only and not for use in diagnostic procedures.
```

By default, readthrough and sense-antisense chimeras, and overlapping gene events, and events involving unannotated exons are filtered out.
By default, readthrough and sense-antisense chimeras, and overlapping gene events, and events involving unannotated exons are marked as low-quality.

They can be emitted by enabling relevant `--emit-{category}` flags, or all can be enabled with `-a/--emit-all-breakpoints`.
They can be unfiltered by enabling `--emit-all-breakpoints.`


## Output <a name="output"></a>

`pbfusion` produces **one output file designed for end users**: `{prefix}.breakpoints.groups.bed`. All other files are auxiliary and usually used for diagnostic purposes.
`pbfusion discover` produces **one output file designed for end users**: `{prefix}.breakpoints.groups.bed`. All other files are auxiliary and usually used for diagnostic purposes.

If verbose output is enabled (`-v`), five output files sharing the same prefix are emitted.

Expand All @@ -136,32 +130,42 @@ If verbose output is enabled (`-v`), five output files sharing the same prefix a
The clustered breakpoint call file is [BEDPE](https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format) file formatted including header lines.

Notes:
- The cluster score is set to a constant "MEDIUM"
- The cluster score is set to "MEDIUM" unless the group is considered low-quality, in which case it is "LOW"
- Scores may be set to "LOW" depending on:
- Low coverage
- Low alignment identity
- High breakpoint variability
- You can `grep` the output for specific gene names or gene IDs if you're fishing for a specific fusion gene pair


### Filtering options

The primary filtering options to reduce false positives are `--min-coverage` (`-m`) and `--fusion-readthrough-threshold` (`-T`).
Filtering is now primarily done through adjusting the allowed score [default "MEDIUM"].
The primary filtering options to reduce false positives are `--min-coverage` (`-c`), `--max-readthrough` (`-a`), `--min-mean-identity` (`-i`), `--min-mean-mapq` (`-p`), `--min-fusion-read-fraction` (`-M`), and `--max-variability` (`-s`).
`--min-coverage` just filters out breakpoints based on their read support, where the default value is 2 to filter out singletons.
`--fusion-readthrough-threshold` is used to discard reads that align to two genes next to each other in the genome.
By default, this is set to a modest 10kb, but for more stringent fusion gene calling, we recommend setting this option to 100kb.
Another option to use for more stringent fusion gene calling is `--filter-groups-failing-any`.
This means that breakpoint pairs that fail any of our internal checks (readthrough, strand switch, unannotated exons, or overlapping genes), that pair will not get emitted.
`--max-readthrough` is used to discard reads that align to two genes next to each other in the genome [default 100kb].
`--min-mean-identity` will assign low confidence to fusions with mean mapping identity lower than the threshold [0.85]. Sometimes in hard to map regions of the genome, the aligner will force an alignment through a region and incur a high edit distance.
`--min-mean-mapq` default is set to `0.` because even with high alignment identity, some genes have close relatives where the aligner will assign a MAPQ of 0.
`--min-fusion-read-fraction` is used to filter long chains of genes.
An example of this would be an IG alignment, where maybe 100 reads align to various annotated regions (eg. IGHA1, IGHV3-23, IGHV3-7, IGHG3, IGHJ5, IGHJ4, IGHJ3, IGHGP, IGHJ2, IGHJ6).
Given a read coverage of these genes like [100, 100, 90, 90, 20, 10, 10, 5, 5, 5], we would by default filter genes with coverage lower than 25% of the total read count for this fusion.
With that filter, you're left with this coverage: [100, 100, 90, 90], which with the read filtering brings it down to [80, 80, 70, 70].
Because this fusion still has >3 genes in it, it would get filtered out.
`--max-variability` allows you to filter based on breakpoint variability [default 1000].


## Examples <a name="examples"></a>

Serializing the input gtf file:
```
gffcache \
pbfusion gff-cache \
--gtf gencode.v38.annotation.gtf \
--gtf-out gencode.v38.annotation.gtf.bin
```

Running `pbfusion` on aligned reads:
```
pbfusion \
pbfusion discover \
--bam isoseq.aligned.bam \
--reference-gtf gencode.v38.annotation.gtf.bin \
--output-prefix isoseq \
Expand Down Expand Up @@ -192,6 +196,7 @@ Visualization dependencies:


### Extract tag
**This script is deprecated now that cell barcodes are emitted automatically**
The main use case for this script is to output associated cell barcodes for fusion gene reads.
`extract_tag.py` will take a BAM file, BAM tag, and a list of read names (can be taken from the output BEDPE and edited for one readname per line).
The output is tab delimited with the read name and its associated cell barcode.
Expand All @@ -206,10 +211,46 @@ python3 extract_tag.py \
>cell_associations.tsv
```

This script will be deprecated once cell barcodes are emitted automatically.

## Changelog
* v0.1.0 - initial release.
## What's up with all the immune genes? <a name="immune"></a>
If you see a lot of immune genes in your output (eg. IGH, IGL, IGK, TRA, TRB), this is completely normal.
These genes are recombined at the genomic level (VDJ or VJ recombination), so aside from filtering the gene names, these will be detected as fusions.


### Changelog
Changelog - PacBio Fusion Detection - pbfusion

## v0.2.2 6/29/23
### Changes
- Command-line interface, filtering, and formatting changes.
- Option to filter by average mapping quality on either side of a breakpoint.
- Option to filter reads from groups based on gene coverage (as a %-age of total reads for that group).
- Filters out entries where the number of genes is 1 or >3.
- We now mark events by quality, and then filter by quality levels instead of directly filtering events.
These events can be emitted with `--min-fusion-quality LOW` instead of the default `MEDIUM`.
- Simplified output. Fewer categories, simpler identifiers, and a new header format that is easier to parse.
- Number of reads filter now checks the `im` bam tag for clustered input. Now, the number of original reads is used for filtering instead of the number of clusters.

## v0.2.1: 6/8/23
### Changes
- Add options to filter segments by length and error rate.

## v0.2.0: 5/22/23
### Changes
- Join pbfusion and pbgffcache into one executable `pbfusion` with subcommands `pbfusion discover` and `pbfusion gff-cache` corresponding to the previous.

## v0.1.2: 5/19/23
### Changes
- Add edit distance per segment as a feature of Interval for filtering.

## v0.1.1: 4/26/23
### Changes
- Add cell barcodes if found, and original read names if reads are clustered to output file. The first tag set in `CB`, `XC`, and `CR` is used.
- Add `--max-mean-breakpoint-distance` flag, which marks breakpoint groups as low quality if the mean distance of breakpoints in a cluster exceeds this threshold.

## v0.1.0
### Changes
- Initial release

## Disclaimer

Expand Down
75 changes: 20 additions & 55 deletions scripts/visualize_fusion.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,25 +56,21 @@ def read_bedpe(bedpe):
'''
with open(bedpe) as f:
for line in f:
if line.startswith('#') or not line.rstrip():
if line.startswith('#'):
continue
sl = line.rstrip().split('\t')
breakpoints = ((sl[0], sl[1]), (sl[3], sl[4]))
reads = set(sl[11][6:].split(','))
aux = sl[10].split(';')
for field in aux:
if field.startswith('GENE_IDS='):
if field.startswith('GI='):
gene_ids = field[9:].split(',')
if field.startswith('GENE_NAMES='):
gene_names = field[11:].split(',')
gene_ids = list(zip(gene_ids, gene_names))
return reads, gene_ids, breakpoints


def read_gtf(gtf, gene_ids):
gids = set([x[0] for x in gene_ids])
id_to_name = {x[0]: x[1] for x in gene_ids}
gtf_dict, transcript_list, tid_to_gname = {}, [], {}
gtf_dict = {}
transcript_list = []
with open(gtf) as f:
for line in f:
if line.startswith('#'):
Expand All @@ -87,9 +83,8 @@ def read_gtf(gtf, gene_ids):
end = int(a[4])
transcript = a[8].split(' transcript_id "')[1].split('"')[0]
gene = a[8].split('gene_id "')[1].split('"')[0]
if gene not in gids:
if gene not in gene_ids:
continue
tid_to_gname[transcript] = id_to_name[gene]
if transcript not in gtf_dict:
gtf_dict[transcript] = []
gtf_dict[transcript].append([chromosome, start, end, type1])
Expand All @@ -105,8 +100,7 @@ def read_gtf(gtf, gene_ids):
types.append(part[3])
transcript_list.append([
chromosome, min(starts), max(ends),
blockstarts, blockwidths, False, types, tid_to_gname[transcript]]
)
blockstarts, blockwidths, False, types])
return transcript_list


Expand All @@ -129,14 +123,10 @@ def read_bam(bam, reads):


def plot_fusion(ref_genes, alignments, breakpoints, output):
alignments = {k: sorted(v) for (k, v) in alignments.items() if len(v) == 2}
if breakpoints[0][0] == breakpoints[1][0]:
same_chr = True
else:
same_chr = False

alignments = {k: v for (k, v) in alignments.items() if len(v) == 2}

gene1_start = min([x[0][1][0][0] for x in list(alignments.values()) if x[0][0] == breakpoints[0][0]])
gene2_end = max([x[1][-1][-1][0] + x[1][-1][-1][1] for x in list(alignments.values()) if x[0][0] == breakpoints[0][0]])
gene2_end = max([x[2] for x in ref_genes if x[0] == breakpoints[1][0]])
gene1_span = abs(int(breakpoints[0][1]) - gene1_start)
gene2_span = abs(int(breakpoints[1][1]) - gene2_end)
total_span = gene1_span + gene2_span
Expand All @@ -155,20 +145,7 @@ def plot_fusion(ref_genes, alignments, breakpoints, output):
# plot the gene annotations
g1_y, g2_y = 0, 0
for transcript in ref_genes:
if same_chr:
bp1_dist = abs(int(breakpoints[0][1]) - transcript[1])
bp2_dist = abs(int(breakpoints[1][1]) - transcript[1])
if bp1_dist < bp2_dist:
panel = gene1
g1_y += 1
ypos = g1_y
color = orange
else:
panel = gene2
g2_y += 1
ypos = g2_y
color = blue
elif transcript[0] == breakpoints[0][0]:
if transcript[0] == breakpoints[0][0]:
panel = gene1
g1_y += 1
ypos = g1_y
Expand All @@ -178,7 +155,6 @@ def plot_fusion(ref_genes, alignments, breakpoints, output):
g2_y += 1
ypos = g2_y
color = blue
panel.set_ylabel(transcript[7], fontsize=4)
start, stop = transcript[1], transcript[2]
panel.plot([start, stop], [ypos]*2, lw=0.2, c=color, zorder=10)
for i in range(len(transcript[3])):
Expand All @@ -194,41 +170,30 @@ def plot_fusion(ref_genes, alignments, breakpoints, output):
for read in list(alignments.values()):
y += 1
for locus in read:
if same_chr:
bp1_dist = abs(int(breakpoints[0][1]) - locus[1][0][0])
bp2_dist = abs(int(breakpoints[1][1]) - locus[1][0][0])
if bp1_dist < bp2_dist:
panel = reads1
color = orange
start, stop = locus[1][0][0], int(breakpoints[0][1])
else:
panel = reads2
color = blue
start, stop = int(breakpoints[1][1]), locus[1][-1][0] + locus[1][-1][1]
elif locus[0] == breakpoints[0][0]:
if locus[0] == breakpoints[0][0]:
panel = reads1
color = orange
start, stop = locus[1][0][0], int(breakpoints[0][1])
elif locus[0] == breakpoints[1][0]:
panel = reads2
color = blue
start, stop = int(breakpoints[1][1]), locus[1][-1][0] + locus[1][-1][1]
start, stop = locus[1][0][0], locus[1][-1][0] + locus[1][-1][1]
for block in locus[1]:
left, width = block
exon = Rect((left, y-0.25), width, 0.5, lw=0, fc=color, zorder=12)
panel.add_patch(exon)
panel.plot([start, stop], [y]*2, lw=0.2, c=color, zorder=10)

padding = 500
gene1.set_xlim(gene1_start-padding, int(breakpoints[0][1])+gene2_span)
gene2.set_xlim(int(breakpoints[1][1])-gene1_span, gene2_end+padding)
gene1.set_ylim(0, g1_y+1)
gene2.set_ylim(0, g2_y+1)
gene1.set_ylabel('ASPSCR1', fontsize=5)
gene2.set_ylabel('TFE3', fontsize=5)
reads1.set_ylabel('Reads')

gene1.set_xlim(gene1_start, int(breakpoints[0][1])+gene2_span)
gene2.set_xlim(int(breakpoints[1][1])-gene1_span, gene2_end)
reads1.set_ylim(0, len(alignments)+1)
reads2.set_ylim(0, len(alignments)+1)
reads1.set_xlim(gene1_start-padding, int(breakpoints[0][1]))
reads2.set_xlim(int(breakpoints[1][1]), gene2_end+padding)
reads1.set_ylabel('Reads')
reads1.set_xlim(gene1_start, int(breakpoints[0][1]))
reads2.set_xlim(int(breakpoints[1][1]), gene2_end)

reads1.spines['right'].set_visible(False)
reads2.spines['left'].set_visible(False)
Expand Down

0 comments on commit ffead68

Please sign in to comment.