-
Notifications
You must be signed in to change notification settings - Fork 2
/
tutorial.0.quickstart.txt
166 lines (113 loc) · 6.67 KB
/
tutorial.0.quickstart.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
0 1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
################################################################################
CELLULOID, an R package, is a tool to evaluate ploidy, cellularity, clonal and
sub-clonal populations from tumour whole genome sequencing data, and to
appropriately derive absolute copy numbers profiles.
Read INSTALL.txt for instructions on how to install celluloid.
################################################################################
INPUT FILES
Input files used in the tutorial are included in the Data directory. Unzip
before use.
CELLULOID expects chromosome numbering to include the prefix "chr", as in
chr1, chr2, ... Chromosomes X and Y can be labelled chrX chrY or chr23 chr24.
The main input files are .wig files that are needed by the HMMcopy package.
See http://compbio.bccrc.ca/software/hmmcopy/ for instructions on how to turn
traditional bam sequence files into wig files, including wig files for gc- and
mappability-related data. Here the files normal.wig, tumour.wig, gc.wig and
map.wig contain more than the standard chromosomes. However, only the autosomal
chromosomes are used in deriving ploidy (autosomal ploidy) and cellularity.
The file AR.txt contains the columns:
CHR POS REF_COUNT VAR_COUNT
chr1 754730 31 17
chr1 754813 10 23
chr1 754840 30 10
chr1 754873 16 20
chr1 755955 26 14
...
where each line corresponds to a heterozygous position in the normal tissue.
REF_COUNT represents the number of reads supporting the reference
allele in the tumour, and VAR_COUNT represents read counts supporting the
other allele. The header line is required.
This file can be created from vcf files using the vcf2het.csh script provided:
vcf2het.csh filename.vcf tumor_column_name output min_depth
Where filename.vcf is the name of the vcf file that contains tumor and normal
genotypes; tumor_column_name if the name of the column (either column 10 or 11)
in filename.vcf that contains tumor data; output is the name of the output file;
and min_depth in an integer value indicating that only heterozygous positions
supported by at least that many reads in the normal will be extracted. Only
lines with FILTER value equal to PASS will be extracted.
Example:
If somatic_calls.vcf contains the following header
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Blood PancAdenocarcinoma
and only heterozygous positions supported by at least 30 reads are to be used,
then the call
vcf2het.csh somatic_calls.vcf PancAdenocarcinoma ARoutput.txt 30
will create the file ARoutput.txt. Since the column name PancAdenocarcinoma was
found in column 11, data for the normal tissue are assumed to be found in
column 10.
################################################################################
RUNNING THE EXAMPLE
As an overview of how to use the package, run the pipeline.r script, which
performs an analysis assuming only one clone is present. To run the script, from
the main celluloid directory, call from the command line (make sure to gunzip
the Data files first):
Rscript pipeline.r Data/tumour.wig Data/normal.wig Data/gc.wig Data/map.wig Data/AR.txt samplename
This script prepares the data and run the analysis. All data objects will be
saved as .rda files in a directory called RdaPipeline. The file
contour_solutions_samplename_1.pdf will also be created.
Details of the data preparation steps are found in
tutorial.2.load_and_show.txt
The analysis uses the segment-based objective function, further described in
tutorial.3.select_and_search.txt
The value of this objective function is an ad hoc measurement of the weighted
fraction of the tumour genome that is captured by the model, defined by values
of ploidy, cellularity and subclonal proportions. Another objective function,
suited for manual interventions, in also described in the tutorial.3 file.
The search for a solution is done by a call of the function coverParamSpace.
Local maxima of the objective function are found by using the optim() function
in R with 50 randomly chosen starting points.
Local solutions are found in the graphical output. See file
Figures/contour_solutions_samplename_1.pdf
The first panel represents the values of the objective function as a function of
the parameter n (the percentage of normal cells in the sequenced sample). This
is to help the user judge if the parameter space was well covered.
Then local solutions are shown in a series of contour plots. The x-axis
represents the (scaled) read counts in tumour segments, and the y-axis
represents values for the allelic ratios (proportion of reads supporting the
reference allele) evaluated at germline heterozygous positions in those
segments. White dots in the graphs represent actual segments.
The red dots and lines represent the integer copy-number states and allelic
ratios that are predicted given the estimates of tumour ploidy and cellularity.
The value of the objective function at the local solution is displayed on the
top left corner.
Details about the figures can be found in
tutorial.4.display_solutions.txt
In pipeline.r, coverParamSpace makes use of a dimension reduction trick (by
setting a value to the argument Sn), because cellularity and ploidy are
interconnected. A first step of this trick consists of estimating "LOH curves",
which may fail in practive in some datasets, in which case manual interventions
may be required. The LOH curves represent the lower and upper values of the AR
expected in regions of LOH, given the copy number state of the tumour in those
regions. These LOH curves are the two symmetrical black curved lines on the
output graphs. See the "OTHER OPTIONS: Sn" section in
tutorial.5.other_options.txt
The user must decide on a solution based on experience, and copy-number
segments can then be plotted:
library(celluloid)
files<-system("ls RdaPipeline/*rda",intern=T); for(f in files){load(f)}
prepCN(12,1,NULL)
ePP<-ePeakPos( S=0.624, t=c(0.073, 0.923), cn=cn )
tcs<- scaleReadCounts( tc , ePP )
segments<-scaleSegments(t.ar.seg , ePP )
segments<-annotateSegments(segments, ePP)
plotSegment( tcs,segments, ar , file="segments_page%1d",device="png",
width=960,height=1320, cex.axis=2, cex.main=2, cex.lab=2,
type="cairo" , tlwd=8 )
# XY chromosomes can't be annotated due to lack of ar
segmentsXY<-scaleSegments(t.seg , ePP )
plotSegment( tcs,segmentsXY, ar=NULL , file="segments_pageXY",device="png",
width=960,height=1320, cex.axis=2, cex.main=2, cex.lab=2,
type="cairo" , chr=c("chrX","chrY") , tlwd=8)
To fit two or more subclones, the user can refer to
tutorial.3.select_and_search