Skip to content

1.3 Prepare fusion pipeline reference data

Yaobo Xu edited this page Aug 20, 2019 · 1 revision

Overview

For each reference-gene build, cgpRna expects reference data to be present in a particular structure. Pre-generated reference data is available for the following reference/gene builds:

  • GRCh38 Ensembl release 77 (full name of reference build is GRCh38_full_analysis_set_plus_decoy_hla)
  • GRCh37 Ensembl release 75 (full name of reference build is GRCH37d5)

If you require reference data for alternative combinations of reference/gene builds this will need to be built following the instructions below.

Pre-generated data

The pre-generated reference data for cgpRna can be downloaded from ftp://ftp.sanger.ac.uk/pub/cancer/support-files/cgpRna/

To decompress run command: tar -zxvf <ref>.tar.gz -C /path/to/decompress/to

You will need approx. 200GB space for each reference set of data.

N.B. The /path/to/decompress/to will then become /ref/data/root or the -r parameter when running the pipeline (see usage for running the mapping and qc pipeline as an example).

Building reference data from scratch

N.B. Do this after running the setup.sh script to install cgpRna software

The reference data needs to have the following basic structure:
<ref-data-root> / <species> / <reference-genome-build> / <algorithm> / <gene-build> e.g.
../ref-data/human/hg38/star/77

  • Reference fasta (.fa) genome files and accompanying .fa.fai files should be placed under the reference-genome-build folder e.g. genome.fa and genome.fa.fai are under <ref-data-root>/human/GRCh38 in the pre-generated data set

Shared reference files

  1. Create the following directory structure: /<ref-data-root>/<species>/<reference-genome-build>/cgpRna/<gene-build> e.g. <ref-data-root>/human/GRCh38/cgpRna/77

  2. A list of normal fusions is needed to filter out false positive calls by the script filter_fusions.pl. The normal-fusion file is a single column of fusion breakpoints in the format: chr1:pos1-chr2:pos2 e.g. 10:100000027-12:93371979. Place the normal fusions file in this location: /<ref-data-root>/<species>/<reference-genome-build>/cgpRna/normal-fusions

  3. Download the Ensembl GTF file for the gene annotation version you are using and place it in the following directory with the name ensembl.gtf: /<ref-data-root>/<species>/<reference-genome-build>/cgpRna/<gene-build> e.g. for e77 (which is compatible with GRCh38) the gtf file was downloaded from this location: http://ftp.ensembl.org/pub/release-77/gtf/homo_sapiens/, it was then decompressed and renamed to ensembl.gtf. N.B. if the reference fasta being used contains "chr" in the chromosome names, this will need to be added to the ensembl.gtf file.

Next index files for the three fusion algorithms need to be built as follows...

Building star reference files

This part can be skipped if you have already followed the instructions on 2. Mapping and QC reference data

To generate the star reference data:

  • 1.Create the following directory structure: /<ref-data-root>/<species>/<reference-genome-build>/star/<gene-build> e.g. <ref-data-root>/human/GRCh38/star/77

  • 2.Run the following command to generate the index files:

<installation-directory>/bin/STAR --runMode genomeGenerate --genomeDir <ref-data-root>/human/<reference-genome-build>/star --genomeFastaFiles <ref-data-root>/human/<reference-genome-build>/genome.fa --sjdbOverhang 99

where installation-directory = path_to_install_to when the setup.sh script was run

N.B.The --sjdbOverhang attribute should be set to the read length-1. The RNA-seq data the pipeline was developed for, contained libraries of 2x75bp and 2x100bp. Where there is a mixture of read lengths like this it is recommended to base the parameter on the longer length so we used 99.

It's also possible to add annotated transcript information to the genome index, at this stage, which greatly improves mapping. From STAR version 2.4.1a onwards this can be included during the mapping step on the fly which is what we use in the cgpRna star_fusion pipeline with parameter: --sjdbGTFfile /human/GRCh38/77/ensembl.gtf so please ensure you have installed a version of STAR compatible with this functionality.

  • 3.Create a soft link to the Ensembl GTF file: cd /<ref-data-root>/<species>/<reference-genome-build>/star/<gene-build>; ln -s /<ref-data-root>/<species>/<reference-genome-build>/cgpRna/<gene-build>/ensembl.gtf

Building defuse reference files

  1. Create the following directory structure /<ref-data-root>/<species>/<reference-genome-build>/defuse/<gene-build>/defuse-index e.g. /<ref-data-root>/human/GRCh38/defuse/77/defuse-index

  2. Copy the following file in the defuse installation directory <installation-directory>/bin/defuse_install/scripts/config.txt to /<ref-data-root>/<species>/<reference-genome-build>/defuse/<gene-build>/defuse-config.txt
    where installation-directory = path_to_install_to when the setup.sh script was run

  3. Edit the config file and update the following values: You may have already updated the source_directory and Paths to external tools section as instructed when the setup.sh script was run. Ensure there are no spaces at the end of each parameter value.

  • ensembl_version = { e.g. 77 }
  • ensembl_genome_version = { e.g. GRCh38}
  • ucsc_genome_version = { e.g. hg38 }
  • source_directory = { <installation-directory>/bin/defuse_install/ }
  • dataset_directory = { the full path of the defuse-index directory created in step 1. }
  • Under the section titled # Paths to external tools provide full paths (including the name of the executable file itself) to the following software that you should have installed when running the cgpRna setup.sh script; samtools, bowtie, bowtie_build, blat, faToTwoBit, R, Rscript, gmap and gmap_build
  1. Run the following script:

<installation-directory>/bin/defuse_install/scripts/create_reference_dataset.pl -c defuse-config.txt

This will take a long time to run (approx 12 hours) but assuming the values in the updated config file (created in steps 2 and 3) are correct, the defuse index files will be generated under
/<ref-data-root>/<species>/<reference-genome-build>/defuse/<gene-build>/defuse-index

Building tophat reference files

TopHat uses bowtie as it's aligner so bowtie index files need to be generated first. TopHat2 (which is the version installed by setup.sh) is able to use bowtie1 or 2 indexes, the index files can be distiguished by the filetype which is .ebwt for bowtie1 and .bt2 for bowtie2. As tophat is being used in this pipeline for fusion calling (rather than mapping), tophat-fusion-post will only currently work with bowtie1 index files.

In addition, tophat-fusion-post is very particular about requiring "chr" in the chromosome names so if you are using a reference fasta without "chr" you will need to prepare two sets of bowtie index files; 1 - based on your reference fasta and associated transcriptome for the initial alignment and 2 - bowtie1 genome index files with "chr": added.

In order to build the index files:

  1. Create the following directory structure: /<ref-data-root>/<species>/<reference-genome-build>/tophat/<gene-build> e.g. <ref-data-root>/human/GRCh38/tophat/77

  2. Run the following command to generate the index files: <installation-directory>/bin/bowtie-build <reference_fasta> <index_base_name> e.g.

<installation-directory>/bin/bowtie-build /<ref-data-root>/<species>/<reference-genome-build>/genome.fa <ref-data-root>/human/GRCh38/genome

  1. Create a soft link to the Ensembl GTF file in the gene build directory: cd /<ref-data-root>/<species>/<reference-genome-build>/tophat/<gene-build>; ln -s /<ref-data-root>/<species>/<reference-genome-build>/cgpRna/<gene-build>/ensembl.gtf

  2. Next generate the transcriptome index files using tophat: <installation-directory>/bin/tophat --bowtie1 -G <gtf-file> --transcriptome-index <index_base_name> <index_base_name_from_step2> N.B. Don't run this command from the --transcriptome-index location as tophat seems to exit with an error message. Running the command in directory /human/GRCh38/tophat worked in testing

<installation-directory>/bin/tophat --bowtie1 -G ensembl.gtf --transcriptome-index <ref-data-root>/human/GRCh38/tophat/77/transcriptome <ref-data-root>/human/GRCh38/genome

  1. Index the transcriptome.fa file that should have been generated in /human/GRCh38/tophat/77/ using samtools which will be installed if setup.sh has been run successfully:

<installation-directory>/bin/samtools faidx <ref-data-root>/human/GRCh38/tophat/77/transcripome.fa

  1. If your fasta file does not contain "chr" in the chromosome names, obtain the equivalent chr fasta, from NCBI for example, and place that and the fasta index (.fai) in the following directory with the name tophatpost.genome.fa (and tophatpost.genome.fa.fai): /<ref-data-root>/<species>/<reference-genome-build>/tophat Then run the command from step 3 again but use the index_base_name as follows...

<installation-directory>/bin/bowtie-build tophatpost.genome.fa <ref-data-root>/human/GRCh38/tophat/tophatpost.genome

  1. If your genome reference does contain "chr" names then create soft links for all the genome bowtie index files in the tophat folder along with the fasta and fasta.fai files i.e.

cd <ref-data-root>/human/GRCh38/tophat; ln -s ../genome.1.ebwt tophatpost.genome.1.ebwt Repeat for all the *ebwt files and also create the softlinks for tophatpost.genome.fa and tophatpost.genome.fa.fai (so there should be 8 soft links in total).

  1. Next, download refGene.txt and ensGene.txt for the species and genome build you are using from here and place them in the following locations: /<ref-data-root>/<species>/<reference-genome-build>/tophat/refGene.txt and /<ref-data-root>/<species>/<reference-genome-build>/tophat/<gene-build>/ensGene.txt* Note: refGene.txt abd ensGene.txt files can be downloaded from :

http://hgdownload.soe.ucsc.edu/goldenPath/<genome_version>/database/ e.g., http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/

  1. Download human_genomic*, other_genomic*, and nt* from the BLAST database and put them in a directory called blast in this location: /<ref-data-root>/<species>/<reference-genome-build>/tophat/blast

-* The tophat fusion website doesn't give any indication about which Ensembl build the ensGene.txt files corresponds to. It is possible to create your own version of the ensGene.txt file assuming you have an input Ensembl gtf file e.g. Homo_sapiens.GRCh38.77.gtf.

The tool that is used to create ensGene.txt is called gtfToGenePred and can be downloaded (for Linux) from here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/. The tool is run in the following way:

gtfToGenePred <ensembl-gtf-file> -genePredExt ensGene_<version>.txt

Some additional processing is required for the file to work with tophatfusion-post. A column of integers needs to be added at the start of the file, it will work just by adding the number 1 using an awk statement like this:

awk -F "\t" '{$3="chr"$3} {print 1"\t"$0}' ensGene_<version>.txt > ensGene.txt

It may also be necessary to alter the format of the chromosome column (column 3 of the updated file) to add chr in front of the chromosome number. If the format of the chromosome is different to the chromosome information given in the file fusions.out generated by the first stage of tophat-fusion, then tophatfusion-post will ignore the ensGene file and the programme will complete after a few mins yielding no results.

Building the VAGrENT cache file

The final stage of the fusion pipeline is to compare and annotate fusions called by STAR-Fusion, Tophat-Fusion and deFuse and this is done by running the script compare_overlapping_fusions.pl. The annotation part makes use of the CGP-IT VAGrENT algorithm which in turn uses Ensembl annotation for the gene build of interest. A cache file needs to be used for this which can either be downloaded from a public FTP site or you can build your own cache file using a VAGrENT script which should have been installed prior to running setup.sh for cgpRna as VAGrENT is a dependency.

  1. Create a vagrent and gene annotation sub-folder folder in this location:

/<ref-data-root>/<species>/<reference-genome-build>/vagrent/<gene-build> e.g. <ref-data-root>/human/GRCh38/vagrent/77

  1. Generate the vagrent cache file by running the script Admin_EnsemblReferenceFileGenerator.pl, which should be located in the /bin directory of the installation path you provided when installing VAGrENT via the setup.sh script. The command is:

perl /software/CGP/canpipe/test/bin/Admin_EnsemblReferenceFileGenerator.pl -o <outdir> -sp <species> -as <ref-build> -d <ensembl-core-db-version> -f <ensembl-ftp-directory-containing-the-cDNA-fasta-sequence-files> e.g. perl /software/CGP/canpipe/test/bin/Admin_EnsemblReferenceFileGenerator.pl -o /ref/human/GRCh38/vagrent/77 -sp human -as GRCh38 -d homo_sapiens_core_77_38p -f ftp://ftp.ensembl.org/pub/release-77/fasta/homo_sapiens/cdna/

  1. Rename the files generated to remove the ref-gene-build prefix e.g.

mv /ref/human/GRCh38/vagrent/77/Homo_sapiens.GRCh38.77.vagrent.cache.gz /ref/human/GRCh38/vagrent/77/vagrent.cache.gz