CNPI

Copy Number Private Investigator, CNPI, is a copy number analysis toolkit developed by the Tychele N. Turner, Ph.D. Lab at Washington University in St. Louis Medical School.

Lead Developer: Jack Ustanik (jacku[at]wustl.edu)

Description

This program (CNPI) is designed to read through gzipped output files generated using the QuicK-mer2 program (quickmer2.gz files) that contain copy number information at specific windows and matching these windows up with start and stop locations from a bed file containing regions of interest. In the example directory, there is a file called RefSeq_Curated.bed that is a bed file containing coordinates for RefSeq genes in the genome. This file was generated from the Table Browser in the UCSC Genome Browser with useful information including the region name, chromosome, and start and stop information. Relevant information from the quickmer2.gz file includes chromosome, start and stop location of windows, and count number for each window. Please note the genome build for both files must be the same. In the example, we use build 38 of the human genome.

Background

DNA Copy Number, in the human genome, is typically two as one copy of each chromosome is inherited from each parent. Most regions of the genome have a copy number of two except in males where there is a copy of one for chromosomes X and Y. This is not always the case as there are variations within everyone's genome as some regions have copy numbers deviating from the expected value. Therefore chromosomes, genes, or regions will have varied copy numbers. Still typically around one or two however they can vary as duplications and deletions are copy number variants (CNVs). CNVs can be in inherited trait from a parent or they can be de novo in the child. Our goal is to identify abnormal regions of a person’s genome based upon all this information given copy number data and reference regions provided by the user of this program. Genotyping across regions, karyotyping, summary statistics, and images can be produced to do all these tasks. The hope is to figure out which parent the CNVs are coming from and alterations that lead to phenotypic consequences.

Goals of CNPI

Scanning through both the region of interest file (RefSeq file in the example) and the .gz file (QuicK-mer2 output file) and creating a copy number tool for each region. Windows are generally smaller than the regions so many windows will be included within each region location.
Statistics will be carried out for each region including copy number average and weighted average (weight is the length of windows included in a region), standard deviation, coefficient of variation, size of region, and total amount of windows included within each region.
Karyotyping of individual based upon the average of all CN windows for an entire chromosome.
Identify Chromosomal Sex based off of X and Y chromosome copy number.
It is thought that region locations with an abnormal high or low count number should be investigated more and could be indicative of phenotypic consequences.
In the case of count numbers that are abnormally high these regions should be recorded with the possibility of a closer look into why.

Example File input

Example Data

Reference.bed Example

chromosome	start	stop	Region Name
chr1	11873	14409	DDX11L1
chr1	14361	29370	WASH7P
chr1	17368	17436	MIR6859-1
chr1	29773	35418	MIR1302-2HG
chr1	30365	30503	MIR1302-2
chr1	34610	36081	FAM138A

Any Tab Delimited File in this format will work as long as the first 4 columns are as shown above

Quickmer.bed Example

chromosome	start	stop	Copy Number Estimate
chr1	0	54484	2.081097
chr1	54484	60739	2.930447
chr1	60739	68902	2.506133
chr1	68902	82642	2.202954
chr1	82642	88348	2.219991
chr1	88348	108310	2.107214

Overall Notes

Genetic Variation (Gen_variation) options include REFERENCE, DUPLICATION, DELETION, TRISOMY
Genetic Variation (Gen_variation) is based on a copy number of two for chromosomes 1-22 in all individuals and for females on the X chromosome, and one for males on the X and Y chromosomes. For a duplicated or deleted chromosome, the program bases the Gen_variation on chromosome 1 or 3.
The following commands are used to sort and filter .bed and .bed.gz files for either reference or genome files
- Sorting to ensure data is in order for maximum speed. Filtering to get rid of unnecessary information within files.
For Sorting and filtering .bed files

grep -E '^(chr[1-9][0-9]*|chrX|chrY)\b' File.bed | sort -k1,1V -k2,2n > sorted_filtered.bed.gz

For Sorting and filtering .bed.gz files

zgrep -E '^(chr[1-9][0-9]*|chrX|chrY)\b' File.bed.gz | sort -k1,1V -k2,2n | gzip > sorted_filtered.bed.gz

Chromosomal sex is based off of possibly XY combinations. Where if X is only present sex is female and when a Y is present sex is male

Usage

This program is designed to read through .gz files containing count number information at specific windows and matching these windows up with region locations of a ref_seq file. RefSeq_Curated.bed file was taken from NCBI with useful information being region name, chromosome, and start and stop information. Relevant information from the .gz file included chromosome, start and stop location of windows, and count number for each window.

The following information will be recorded for each reference region:

Transcript	Chromosome	Start	Stop	CN_Average	Weighted_Avg	CN_SD	CN_SV	Region_Size	Total_Windows	Gen_Variation

Compile and Run Commands:

g++ -std=c++11 CNPI.cpp -o CNPI -lz  
./CNPI -d {region.bed} -g {quickmer2.gz}

As a single command:

g++ -std=c++11 differentLengths.cpp -o CNPI -lz && time ./CNPI -d {sorted_annotated_region.bed} -g {quickmer2.gz file} -n {number_of_reference_regions} -o {output_file_names}

Options

-d or -bed_gz_path: .bed or .bed.gz file with regions to match copy number windows up against - Required!  
-g or -gz_path: .bed or .bed.gz file wih regions containing cn windows (quickmer2 output) - Required!  
-n or -bed_path_rows: number of reference file - Optional  
-c or -cn_rows: number of rows in quickmer2 file - Optional 
-o or -output_name: preferred name of output file  
-r or -record: returning a record from reference file to the terminal - Optional  
-s of -average_check: Will Execute Count Abnormalities function. Input a number and all regions that are greater than the number from the expected Copy number will be saved to a file  
-b of -sd_abnormalities: Will Execute Standard Deviation Abnormalities function. Input a number and all regions having a greater standard deviation greater than it will be saved to a file  
-e or -chromosome: For displaying a certain chromosome to the terminal  
-p or -start_stop; For displaying a certain range withing a chromosome to the terminal. Also need to input -e or -chromosome to run

Example Genotype.txt output

Chr	Start	Stop	Region	CN_Average	Weighted_Avg	CN_SD	CN_CV	Region_Size	Total_Windows	Gen_Variation
chr1	11873	14409	DDX11L1	2.081	2.081	nan	nan	2536	1	REFERENCE
chr1	14361	29370	WASH7P	2.081	2.081	nan	nan	15009	1	REFERENCE
chr1	17368	17436	MIR6859-1	2.081	2.081	nan	nan	68	1	REFERENCE
chr1	29773	35418	MIR1302-2HG	2.081	2.081	nan	nan	5645	1	REFERENCE
chr1	30365	30503	MIR1302-2	2.081	2.081	nan	nan	138	1	REFERENCE
chr1	34610	36081	FAM138A	2.081	2.081	nan	nan	1471	1	REFERENCE
chr1	65418	71585	OR4F5	2.355	2.203	0.214	9.087	6167	2	REFERENCE

Example Karyotype.txt

46,XY

Chr	End_Pos	CN_Avg	Tot_Windows	SD
chr1	248944960	2.007	179591	0.255
chr2	242183392	1.995	195597	0.242
chr3	198173712	1.984	162500	0.328
chr4	190191424	1.948	155341	0.311
chr5	181356432	1.984	145554	0.297
chr6	170739136	1.987	139223	0.288
chr7	159333744	1.981	122770	0.281
chr8	145076864	1.977	118263	0.277
chr9	138233904	1.984	90671	0.274
chr10	133787152	1.997	106366	0.272
chr11	135076032	1.985	107079	0.268
chr12	133263864	1.995	106197	0.265
chr13	114351408	1.946	80972	0.266
chr14	106883664	1.999	71915	0.263
chr15	101980848	2.027	63245	0.266
chr16	90215248	2.031	59384	0.415
chr17	83245312	2.051	59497	0.413
chr18	80261320	1.967	63126	0.405
chr19	58607060	2.012	39356	0.402
chr20	64333792	2.012	50089	0.400
chr21	46681576	2.021	27792	0.402
chr22	50799832	2.043	26463	0.403
chrX	156029984	1.000	114705	1.042
chrY	56884848	1.028	10049	1.016

Process

How CNPI Code Runs

Python Plotting

Paired with C++ codes for custom visualization of copy number and genotype output

Required

Karyotype file input to tell the program if it is 46XY 46XX etc..
quickmer.gz file where windows stats will computed against

Goals

Using python to start creating figures based on copy number data
Plotting program can output all values outside of a certain threshold
Can also save regions or a certain number that are consecutively outside of the threshold. Up to 3 fails within the chunks
writes this information to a file
Creates images based off of specific ranges that are specified

Files need to be the same length when plotting a trio Use sort and filter commands to get rid of unnecessary lines

Usage

Ran As:

python3 CNPI_plotting.py -f sorted_filtered.bed.gz -r Karyotype.txt

Possible Commands

-f File: The file of the child or patient that we are trying to plot  
-p file1 file2: Files containing information for the parents of the child. With 2 parents and a child we can plot trio or a duo  
-r Reference: background information regarding the chromosomes and sex of the child  
-W MINWINDOW: The minimum amount of windows that are consecutively outside of the 1.3 and 2.7 window. Outside of this range indicates an alteration from normal copy number  
-I WINDOWBUFFER: The Copy Number may oscillate around the 1.3 or 2.7 threshold and if it jumps within the okay region and then back out again this is a buffer of windows that can consecutively fall within the okay region before falling out again  
-se SELECTCHRM: If you want to visualize a particular chromosome you can indicate which here  
-start STARTPOS: If you want to plot a chromosome and specific start position  
-stop STOPPOS: If you want to plot a chromosome and specific stop position  
-gstat GSTAT_TXT: For including transcript regions on the visuals


usage: CNPI_plotting.py -f FILE -r REFERENCE [-w MINWINDOW] [-i WINDOWBUFFER] [-se SELECTCHRM] [-start STARTPOS] [-stop STOPPOS] [-gstat GSTAT_TXT] [-p [file1 [file2]]] [-h]

Dockers

Linux

docker run -v "$(pwd):/app:ro" -it jackust/cn_docker:CNPI_Linux_V1.0

Mac

docker run -v "$(pwd):/app:ro" -it jackust/cn_docker:CNPI_Mac_V1.0

Example Pictures

Plotting of a Region Outside of Normal (1.3 - 2.7) Threshold

Dash Lines at the bottom represent locations of quickmer window readings
Colored boxes within graph represent different genes corresponding to the genes within the table on the right
Red dash lines representing within normal copy number range (1.3 - 2.7)
The blue line in the middle at 2 representing a copy number value of 2
Plotted blue line throughout the graph representing the copy number across chromosome

Duplication Event Along Chromosome 16

Plotting Based off an Inputted Range

Red dash lines representing within normal copy number range (1.3 - 2.7)
The blue line in the middle at 2 representing a copy number value of 2
Plotted blue line throughout the graph representing the copy number across inputted range

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
CNPI_Codes		CNPI_Codes
Example_Data		Example_Data
Photos		Photos
LICENSE		LICENSE
README.md		README.md

License

TNTurnerLab/CNPI

Folders and files

Latest commit

History

Repository files navigation

CNPI

Description

Background

Goals of CNPI

Example File input

Reference.bed Example

Quickmer.bed Example

Overall Notes

Usage

The following information will be recorded for each reference region:

Compile and Run Commands:

As a single command:

Options

Example Genotype.txt output

Example Karyotype.txt

46,XY

Process

Python Plotting

Required

Goals

Usage

Ran As:

Possible Commands

Dockers

Linux

Mac

Example Pictures

Plotting of a Region Outside of Normal (1.3 - 2.7) Threshold

Duplication Event Along Chromosome 16

Plotting Based off an Inputted Range

Copy Number Across Chromosome 6

Abnormal Copy Number Across Chromosome 12

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages