Installation

bcbio installed with PATH and PYTHONPATH set.
crg and crt cloned to ~/crg and ~/crt and added to PATH.

Usage

Create a bed file for small and structural variants prioritization
- request a list of ensembl_ids for genes
- if a gene list comes from Phetotips:
  Rscript ~/bioscripts/genes.R phenotips_hpo2gene_coordinates phenotips_hpo.tsv. Stringr should be >=1.4.
- or use genes.R in a custom case
- Some genes might be missing (don't have ENS IDs in a phenotips tsv file, they are reported by script, you can try ~/cre/data/missing_genes_grch37.bed or GeneCards/Ensembl resources to find them).
- sort and merge with bedtools\
```
bedtools sort -i unsorted.bed > sorted.bed
bedtools merge -i sorted.bed > project.bed
```
- result is project.bed
Align reads vs GRCh37 reference with decoy
- Create a project(=case=family) dir:
  mkdir -p project/input
- Copy/symlink input file(s) to project/input: project_sample.bam, or project_sample_1.fq.gz and project_sample_2.fq.gz
- Create bcbio project:
  crg.prepare_bcbio_run.sh project align_decoy
- Run bcbio project:
  qsub ~/cre/bcbio.pbs -v project=project
  For multiple projects create list of projects in projects.txt and run
  qsub -t 1-N ~/cre/bcbio.array.pbs
  where N = number of projects.
- To speed up the process, run one project per sample.
Remove decoy reads:
qsub ~/cre/cre.bam.remove_decoy_reads.sh -v bam=$bam.
Keep original bam with decoy reads to store all data.
Some SV callers (manta) are sensitive to reads mapped to decoy even with one mate.
Call small variants
- Create a project dir:
  mkdir -p project/input
- Symlink bam file(s) from step1 to project/input: project_sample.bam Small variant calling is not sensitive to the presense of decoy reads.
- Create bcbio project:
  crg.prepare_bcbio_run.sh project no_align
- Run bcbio:
  qsub ~/cre/bcbio.pbs -v project=project
- Clean up bcbio project:
  qsub ~/cre/cre.sh -v family=<project>,cleanup=1,make_report=0,type=wgs
Create excel reports for small variants.
- coding report:
  qsub ~/cre/cre.sh -v family=project
- noncoding variants for gene panels:
  - subset variants:
    bedtools intersect --header -a project-ensemble.vcf.gz -b panel.bed > project.panel.vcf.gz
  - reannotate variants in panels and create gemini.db:
    qsub ~/cre/cre.vcf2cre.sh -v original_vcf=project.panel.vcf.gz,project=project
  - build report:
    qsub ~/cre/cre.sh -f family=project,type=wgs
- noncoding variants for gene panels with flank
  - modify bed file, add 100k bp to each gene start and end:
    cat panel.bed | awk -F "\t" '{print $1"\t"$2-100000"\t"$3+100000'
  - proceed as for noncoding small variant report
- de-novo variants for trios
Call structural variants (in parallel with step 3)
- MetaSV calls spades - a genome assembler, for every SV, making bcbio run computationally intensive. To speed up use sv_regions.bed and call samples individually. They are combined downstream during report generation.
- Create project dir:
  mkdir -p project/input
- Symlink a bam file. from step 2 to project/input: project_sample.bam.
- Copy project.bed to project.input
- Create bcbio project:
  crg.prepare_bcbio_run.sh project sv project/input/sv_regions.bed\
- Run bcbio:
  qsub ~/cre/bcbio.pbs -v project=project
Create excel reports for structural variants (Report columns)
- Navigate to project/sv
- Report on SV's occuring in each sample:
  - Run: crg.sv.prioritize.sh sample panel.bed on the *-metasv.vcf.gz file in each sample's folder.
  - Optional crg.sv.prioritize.sh will produce a sample.tsv file. A contact in TCAG can annotate this file to add an additional column to the sample's SV report, DGV. Otherwise, this column will show up as NA. Run crg.sv.prioritize.sh sample panel.bed tcag_annotated_file.tsv to produce this report.
- Report on SV's across multiple samples:
  - Gather each report from the previous step in to a single directory
  - Run: crg.intersect_sv_reports.sh project to produce a single report summarizing structural variants across all samples

AnnotSV

AnnotSV must be set up as apart of the local environment to generate family level reports. Users should set FeaturesOverlap and SVtoAnnOverlap to 50 in the configFile. Because these scripts group SV's which have a 50% recipricol overlap, annotation should follow a similar rule.

Report columns:

CHR
POS
GT
SVTYPE
SVLEN
END
SOURCES: which programs called the event
NUM_SVTOOLS: how many programs supported the event
GENES: genes overlapping the event (mostly one gene)
ANN: raw annotation from VEP
SVscores: SVscore-github, SVscore-article
- SVSCOREMAX
- SVSCORESUM
- SVSCORETOP5
- SVSCORETOP10
- SVSCOREMEAN
DGV: frequency in DGV, 8000 WGS

Family report

crg.intersect_sv_reports.py generates a report summarizing structural variants across several samples. It groups structural variants of similar size and position in to a single "reference" structural variant. Grouping is useful when analyzing families as most structural variants should be similar and conserved across samples.

The script produces a CSV file which can be analyzed using spreadsheet software.

Family report requirements:

~/gene_data directory containing the following files:

HGMD=${HOME}/gene_data/HGMD_2018/hgmd_pro.db
EXON_BED=${HOME}/gene_data/protein_coding_genes.exons.fixed.sorted.bed
HPO=${HOME}/gene_data/HPO_2018/${FAMILY_ID}_HPO.txt
EXAC=${HOME}/gene_data/ExAC/fordist_cleaned_nonpsych_z_pli_rec_null_data.txt
OMIM=${HOME}/gene_data/OMIM_2018-11-01/genemap2.txt

AnnotSV

AnnotSV must be set up as apart of the local environment to generate a family report. Users should set FeaturesOverlap and SVtoAnnOverlap to 50 in the configFile. Because these scripts group SV's which have a 50% recipricol overlap, annotation should follow a similar rule.

DGV and DDD columns are annotated by AnnotSV.

Family report columns:

In-depth column descriptions

Includes all of the columns above, except SOURCES, NUM_SVTOOLS, SVTYPE and ANN, in addition to:

N_SAMPLES: number of samples a reference interval overlaps with
LONGEST_SVTYPE: SVTYPE taken from the longest overlapping SV
EXONS_SPANNED: number of exons a SV affects
DECIPHER_LINK: hyperlink to DECIPHER website for the reference interval
DGV_GAIN_IDs
DGV_GAIN_n_samples_with_SV
DGV_GAIN_n_samples_tested
DGV_GAIN_Frequency
DGV_LOSS_IDs
DGV_LOSS_n_samples_with_SV
DGV_LOSS_n_samples_tested
DGV_LOSS_Frequency
DDD_SV
DDD_DUP_n_samples_with_SV
DDD_DUP_Frequency
DDD_DEL_n_samples_with_SV
DDD_DEL_Frequency
OMIM MIM#
OMIM INHERITANCE
OMIM DESCRIPTION
SYNZ
MISZ
PLI
GENES_IN_HGMD
HGMD_DISEASE
HGMD_TAG
HGMD_DESCRIPTION
HGMD_COMMENT
HGMD_JOURNAL_INFO
HGMD_GROSS_INSERTION: gross (>20bp) insertion events in this gene that have been observed in HGMD
HGMD_GROSS_DUPLICATION: gross duplication events in this gene that have been observed in HGMD
HGMD_GROSS_DELETION: gross deletion events in this gene that have been observed in HGMD
HGMD_COMPLEX_VARIATION: complex variations (combination of indels, translocations, SNP, fusions, inversions) in this gene that have been observed in HGMD
SAMPLE: does this sample have an overlapping SV in it? (0,1)
SAMPLE_details: what are the SV's in this sample which overlap with the reference?
SAMPLE_GENOTYPE

Result dir structure:

project(family)_ID:

bcbio-align: config and final dirs from bcbio align-decoy run
bcbio-small-variants: bcbio configs and vcfs from bcbio small variant, output of cre for coding report
bcbio-sv: SV output from bcbio
genes: HPO, gene list, bed file
panel: non-coding report for gene panel cre dir
panel-flank100k: non-coding report for gene panel +100k flank cre dir
reports: csv report we send
tcag: tcag analysis
bam and bai files (without ready) - in the top directory to for easy access, bams from align-decoy step!

Use case: compared SV calls from TCAG (ERDS) to MetaSV

bcftools view -i 'ALT="<DEL>"' 159_CH0315.pass.vcf.gz | bcftools query -f '%CHROM\t%POS\t%INFO/END\n' -o 159.metasv.del.bed
cat 159_CH0315.erds+_db20171204_20180815_3213_annotated.tsv |  awk '$5 ~/DEL/{print $2"\t"$3"\t"$4}'  > 159.tcag.del.bed
bedtools intersect -a 159.tcag.dup.bed -b 159.metasv.dup.bed -f 0.5 -wo -r | wc -l

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
config		config
LICENSE		LICENSE
README.md		README.md
crg.cnvkit.sh		crg.cnvkit.sh
crg.intersect_sv_reports.py		crg.intersect_sv_reports.py
crg.intersect_sv_reports.sh		crg.intersect_sv_reports.sh
crg.manta.parse.sh		crg.manta.parse.sh
crg.prepare_bcbio_run.sh		crg.prepare_bcbio_run.sh
crg.sv.parse.py		crg.sv.parse.py
crg.sv.prioritize.sh		crg.sv.prioritize.sh
crg.svscore.sh		crg.svscore.sh
crg.validation.SV.HuRef.md		crg.validation.SV.HuRef.md
crg.variants.filter.py		crg.variants.filter.py
crg.vcf2cre.sh		crg.vcf2cre.sh
crg.vcf2tsv.py		crg.vcf2tsv.py
crg.vcfanno.conf		crg.vcfanno.conf
crg.vcfanno.lua		crg.vcfanno.lua
crg.vcfanno.sh		crg.vcfanno.sh
crt.tcag.subset.py		crt.tcag.subset.py
structuralvariant.py		structuralvariant.py
xhmm.step1.sh		xhmm.step1.sh
xhmm.step2.sh		xhmm.step2.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Usage

AnnotSV

Report columns:

Family report

Family report requirements:

~/gene_data directory containing the following files:

AnnotSV

Family report columns:

Result dir structure:

Use case: compared SV calls from TCAG (ERDS) to MetaSV

About

Releases

Packages

Languages

License

nvteja/crg

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

AnnotSV

Report columns:

Family report

Family report requirements:

~/gene_data directory containing the following files:

AnnotSV

Family report columns:

Result dir structure:

Use case: compared SV calls from TCAG (ERDS) to MetaSV

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages