Skip to content

Converting Data to BEDPE Format

Will Casazza edited this page Oct 22, 2017 · 6 revisions

BEDPE Notes

Overview

Notes from our discussions on the BEDPE format we're aiming for as a common representation: In order to present paired-coordinate observations (such as gene fusions)in a standard way we choose the BEDPE format as described by bedtools.

Given that certain fields mapping to BEDPE are either not consistently reported across formats, or may otherwise report data in a way that does not cleanly map into the BEDPE categories, we have defined some general grey areas / exceptions:

  • If a tool defines a score or quality-metric, that should be included as-is in the score field
  • If there is no 'score' defined, fill with '0' for 'missing'
  • Similarly, if there is an obvious 'name', use that. If not, either leave blank or fill with a simple fusion string: "GENE1-GENE2"
  • If tools only define a single breakpoint for each side of the fusion, the coordinates used in the BEDPE file should indicate that breakpoint using (preferably) a single base where the breakpoint occurs - see PAVfinder output for a representative example
  • For each feature associated with the first or second encountered gene (such as chrom1/chrom2 or strand1/strand2), if a fusion-finding output file fails to specify first or second event, we will just assign the first reported genes to 1, and the second reported genes to 2.
  • For all additional columns not mapping to a standard BEDPE category, the columns should be appended to our BEDPE file in the order in which they appear in the original format from left to right
  • Unless otherwise specified by the input format, all start coordinates from the input format are assumed to be 0 based (i.e. the first base is at position 0) and all end coordinates are assumed to be 1 based

Assumptions and issues specific to fusion detection tools

FusionCatcher

A description of FusionCatcher's report of gene fusions can be found at the FusionCatcher GitHub page. Gene fusions are reported in a tab delimited file labeled final-list_candidate_fusion_genes.txt. No scoring information is provided. Both start and end coordinates of gene fusions are 1 based.

EricScript

A description of EricScript's output can be found on the EricScript Google Site. No indexing information for the start and end coordinates is provided there. The score reported is an aggregate score of 3 measures devised by the authors of EricScript (Benelli et al, 2012). It is an AdaBoost classifier trained on sythetic data. More information can be found in the associated publication here.

SOAPfuse

A description of SOAPfuse's output can be found on the SOAPfuse SourceForge Wiki. No indexing information for the start and end coordinates is provided there. There is no score measure, yet read count information regarding spanning and junctions could possibly used in developing a score in the future. For more information see the SOAPfuse Section of the SOAP website.

FusionMap

A description of FusionMap's Output can be found on the FusionMap Array Suite Wiki. No indexing information for the start and end coordinates is provided there. No score is specified, but a score could possibly be derived from count information provided in the output. For more information see the associated publication here.

deFuse

A description of deFuse's output can be found on the deFuse SourceForge Wiki. The indexing information is not explicitly stated, yet the format appears to be 0-based for start coordinates and 1-based for end coordinates. No singular score is provided, but several different P values are reported, so in the future these could either be combined or one value could be selected as a measure of confidence of the detection. For more information see deFuse's associated publication here.

JAFFA

A description of JAFFA's output can be found on the JAFFA GitHub Wiki. No indexing information for the start and end coordinates is provided there. There is no numerical score reported but the tool does report low, medium, and high confidence categories, which one could feasibly map to a set value using the standard for confidence reported by JAFFA's authors. For more information see the publication associated with JAFFA here.

STAR-Fusion

A description of STAR-Fusion's output can be found on the STAR-Fusion GitHub Wiki. No indexing information for the start and end coordinates is provided there, but according to their source code it appears to be 1-based, so that is how our parser treats the data. No score is specified, but the FFPM(Fusions Found Per Million) score could be used as a significance/quality cutoff for found fusions. The format is tabular and newline terminated, and it may make sense to include it.

InFusion

A description of InFusion's output can be found on the InFusion Bitbucket Wiki. No indexing information for the start and end coordinates is explicitly stated there. No single score is provided, but a properly aligned pairs metric is included. For more information see the publication associated with InFusion here.

PRADA

A PDF containing a description of PRADA's output can be found on the PRADA webpage. Indices appeared to be ordered at 0 at both coordinates. No single score was provided. For more information see the publication associated with PRADA here. Multiple gene fusion events for the same gene pair are present in some 'Junction' cells. In these cases of unique gene fusions, new entries were created with separate coordinate information.

Assumptions and issues specific to database reports of gene fusions

COSMIC

Some issues:

Cosmic flat file conversion

  • No genomic co-ordinate information but about a third of the 80,000 had HGVS notation of the fusion
  • Implemented in R and ended up having to loop per row and query the Ensembl API to convert HGVS rna notation into genomic co-ordinates.
  • No versions of transcripts means a genomic co-ordinate per version was returned from the API, a quick look seems like the first returned is the most recent version, but haven't explicitly converted that.
  • Ended up being very slow because a vectorised approach wasn't discovered.

CIViC

ChimerDB

Downloads for ChimerDB, ChimerSeq and ChimerPub were available on ChimerDB. The fields for the downloaded files are similar, but not the same as the fields described for the GUI on the Help page. ChimerPub has a confidence score based on top text-mined gene fusion sentences. ChimerPub does not have many of the bedpe fields but rather annotation descriptors based on text-mining PubMed.

Pancreatic Cancer Action Network (PanCan)

No formal description of the database format was available on the Tumor Fusion Gene Data Portal website. However, there is some information on scoring used, and the centrality score, a measure of functional damage of a given gene fusion, was reported in our converted BEDPE score column. For more information on the centrality score see TARGETgene software framework and the Tumor Fusion Gene Data Portal help section.

Some issues:

  • Data is tabular and formatted with carriage returns other than newlines, an issue for parsing downstream.