Skip to content

Create a "Diploid Genome" and "Diploid GTF GFF" file (method B)

Kiran N' Bishwa edited this page Jun 11, 2018 · 1 revision

Create a Diploid Genome and GTF,GFF (method-B) using:


Below, is a bash script using a for-loop implementation for all available samples.

#!/bin/bash

set -euf pipefail  # to break the pipeline if a workflow is disturbed in any of the code lines.

# redirect stdout/stderror to a file
exec &> CreateDiploidGenome.log

#### Function/Purpose:  
# ** Note: Now, we have phased VCF after phaseExtension and Stitching we can extract the Gene sequences using Ref Genome, GTF/GFF and VCF. 
# This gene sequence can be used for 
  # - aligning reads competitively to diploid transcriptome 
  # - used the sample wise genes, transcript sequence for population genetics analyses.

echo "Part A: Convert the VCF to gzipped format and create index"
# Convert VCF to gzipped VCF and create tabix index 
bgzip -f phased.MySpF1.vcf
tabix -f phased.MySpF1.vcf.gz

echo
echo "Part B: Assign the file and folder path"
## Assign the file and folder path
PhasedVcfGz=phased.MySpF1.vcf.gz
G2GTools=/home/priyanka/g2gtools-master/bin/g2gtools
RefGenome=lyrata_genome1To10.fa
RefGtf=lyrata_RawatV2.chr1To10.gtf
RefGff=lyrata.V2.chr1To10.gff

echo
echo "Part C: Starting G2Gtools"
echo "Creating folder for storing output from G2Gtools"
G2GOUT=g2gResults
rm -rf ${G2GOUT}; mkdir ${G2GOUT}


# Now, start exacting required data for each sample.
echo
echo "Starting for loop on given sample names"
echo

# Put, the name of all the required samples. 
for item in ms01e ms02g ms03g ms04h MA605 MA611 MA622 MA625 MA629 Ncm8 Sp3 Sp21 Sp76 Sp154 Sp164 SpNor33
do
    echo
    echo "working on sample ${item}"

    # make sample wise directory on the fly
    rm -rf ${G2GOUT}/${item}_g2gOut; mkdir ${G2GOUT}/${item}_g2gOut 

    # 01) now, create VCI file for each sample
    echo
    echo "Step 01: making VCI file for the given sample ${item}"
    python ${G2GTools} vcf2vci -i ${PhasedVcfGz} \
      -s ${item} -o ${G2GOUT}/${item}_g2gOut/${item}.vci --diploid

    echo
    echo "Step 02 - A: Creating Diploid Genome and patching SNPs"
    # 02 - A) Create Diploid Genome using reference genome and patch SNP using sample wise VCI file
    python ${G2GTools} patch -i ${RefGenome} \
      -c ${G2GOUT}/${item}_g2gOut/${item}.vci.gz \
      -o ${G2GOUT}/${item}_g2gOut/${item}.SNP.pathched.fa

    echo
    echo "Step 02 - B: Patching InDels to the SNP patched Genome"
    # 02 - B) Patch InDels to the SNP patched Diploid Genome
    python ${G2GTools} transform -i ${G2GOUT}/${item}_g2gOut/${item}.SNP.pathched.fa \
      -c ${G2GOUT}/${item}_g2gOut/${item}.vci.gz \
      -o ${G2GOUT}/${item}_g2gOut/${item}.Snp.Indels.patched.fa

    echo
    echo "Step 03: Creating sample GTF,GFF from sample ${item} VCI and reference GTF,GFF file"
    # 03) Convert reference GFF and GTF to sample wise GTF and GFF

    echo "Step 03 - A: Creating sample ${item} GTF file"
    # 03 - A) Ref Gtf to sample Gtf
    python ${G2GTools} convert -i ${RefGtf} \
      -c ${G2GOUT}/${item}_g2gOut/${item}.vci.gz \
      -o ${G2GOUT}/${item}_g2gOut/${item}.gtf

    echo "Step 03 - B: Creating sample ${item} GFF file"
    # 03 - B) Ref Gff to sample Gff
    python ${G2GTools} convert -i ${RefGff} \
      -c ${G2GOUT}/${item}_g2gOut/${item}.vci.gz \
      -o ${G2GOUT}/${item}_g2gOut/${item}.gff

    echo
    echo "Step 04: Creating SQLite database from personal ${item} GTF"
    # *Note: Need to add a method to prepare SQlite database from personal GFF.
    # 04) Convert sample Gtf to SQlite database file
    # Parse custom gene annotation into g2g database file
    python ${G2GTools} gtf2db \
      -i ${G2GOUT}/${item}_g2gOut/${item}.gtf \
      -o ${G2GOUT}/${item}_g2gOut/${item}.gtf.db


    echo
    echo "Step 05 : Extract genes, transcritps sequences"
    # 05) Extract genes, transcripts and exons
      # for now, we are only extracting genes and transcripts

    echo "Step 05 -A : Extracting gene sequences for this sample ${item}"
    # 05 - A) Genes
    python ${G2GTools} extract \
      -i ${G2GOUT}/${item}_g2gOut/${item}.Snp.Indels.patched.fa \
      -db ${G2GOUT}/${item}_g2gOut/${item}.gtf.db \
      --genes > ${G2GOUT}/${item}_g2gOut/${item}.genes.fa

    echo "Step 05 - B: Extracting transcript sequences for this sample ${item}"
    # 05 - B) Transcripts
    python ${G2GTools} extract \
      -i ${G2GOUT}/${item}_g2gOut/${item}.Snp.Indels.patched.fa \
      -db ${G2GOUT}/${item}_g2gOut/${item}.gtf.db \
      --transcripts > ${G2GOUT}/${item}_g2gOut/${item}.transcripts.fa

    echo
    echo "Completed Sample ${item}"
    echo "##### ****** #####"
    echo

done

echo "####    Completed all G2GTools process    ####"
echo

Next Step: Now, proceed to competitive alignment on diploid genome.

Competitive alignment can be do on either Reference Transcriptome, Reference Genome or Both.

  • Alignment to Diploid Transcriptome can be done using Bowtie
  • Alignment to Diploid Genome can be done using using any reference genome base RNAseq alignment tool.
  • Alignment to both Genome and Transcriptome can be done using rnaSTAR.