-
Notifications
You must be signed in to change notification settings - Fork 2
/
tutorial.1.background_and_methods.txt
232 lines (180 loc) · 9.03 KB
/
tutorial.1.background_and_methods.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
0 1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
###############################################################################
The sequenced sample consist of a mixture of normal cells and tumour cells
that come from, perhaps, different subclones. Two subclones are different
with respect to at least one segment that has a different number of copies
between the two.
After alignment, reads are counted in 1 kbp bins (say) using functions from the
R package HMMcopy. These counts are then adjusted for the GC content of each bin
using LOESS (local) regression and scaled to the mean (scaled gc-corrected read
count [SRC]). Segmentation of the data in both tumor and normal tissue (say,
from matched non malignant tissue or from blood) is performed using penalized
least squares, as implemented in the R package copynumber. To each segment is
assigned the mean SRC value, calculated from the 1kb bins within the segment.
SRC is proportional to the mean number of chromosomes (copies), averaged over
all sequenced cells.
Germline heterozygous positions are extracted in the autosome, except in regions
of the genome where duplication or deletion events are observed in the normal
tissues. The number of reads supporting each allele (the reference allele – the
one observed on the reference human assembly – and the alternate allele) is
recorded from the tumor data and the allelic ratio (AR; the proportion of reads
supporting the reference allele) calculated. Each heterozygous position is also
paired with the SRC value of the segment it belongs to, evaluated from the tumor
data, to form pairs of values (SRC, AR). These pairs of points can be
represented in a three-dimensional graph as a contour (elevation) plot (see
Figure1.png). This figure is a visual representation of the autosomal-wide copy
number profile of the tumor. Each peak (or pair of peaks since the graph is
relected around AR=0.5) corresponds to a specific copy-number state that
summarizes both the average copy number (on the x-axis, once appropriately
scaled) and the ratio of relative abundance of maternal and paternal copies (on
the y-axis, once contamination from normal tissues – or tumor cellularity – is
accounted for).
Let S be the SRC that is to be expected in the sequenced sample in a segment
that has exactly two copies in all cells (normal and tumour cells). Let t
describes the percentage of the sequenced cells that are normal and derived from
the different subclones: t=c(t[1],t[2],..,t[nsubclones+1]); sum(t)=1. t[1] is
the percentage of normal cells in the sequenced sample. We may also use n to
represent t[1].
We define the autosomal ploidy of the sequenced sample (the mixture of tumor and
normal cells) as the relative abundance of DNA compared to that of a normal
haploid autosomal genomes. This sample ploidy is calculated to be 2/S.
For given S and t, and given copy number in all subclones (normal cells are
assumed to have 2 copies), the peak locations (expected SRC and
expected AR) can be predicted from mathematical expressions.
For this we need to work with a set of allowed copy-number configurations. In
the simplest case of a model with only one clone (where all segments have the
same number of copies in all tumour cells), the set of allowed copy-number
configurations is only determined by the maximum number of copies a segment can
have. In celluloid, this set is defined with a call of prepCN:
E.g., the call of prepCN(maxc=12) creates the data.frame:
cn
N T1
1 2 0
2 2 1
3 2 2
4 2 3
5 2 4
6 2 5
7 2 6
8 2 7
9 2 8
10 2 9
11 2 10
12 2 11
13 2 12
The first line represents a configuration where normal cells (column N) have
two copies and tumour cells (column T1) have 0 copies in a segment. The next
line represents a configuration where normal cells have two copies and tumour
cells 1 copy, etc. A segment is only allowed to have one of these
configurations. It is for these configurations that expected peak locations
are derived.
If two or more subclones are allowed, a segment can have different number of
copies in different tumour cells. Even though any combination of values could be
allowed, the copy-number configuration can be simplified by assuming that the
number of copies a segment has can not differ between two subclones by more than
some pre-specified number, such as in
prepCN(maxc=6, nsubcl=2, maxsubcldiff=1 )
cn
N T1 T2
1 2 0 0
2 2 1 0
8 2 0 1
9 2 1 1
10 2 2 1
16 2 1 2
17 2 2 2
18 2 3 2
24 2 2 3
25 2 3 3
26 2 4 3
32 2 3 4
33 2 4 4
34 2 5 4
40 2 4 5
41 2 5 5
42 2 6 5
48 2 5 6
49 2 6 6
This describe the number of copies a segment can have in normal cells (column
N), in tumour cells T1 and tumour cells T2. We specified with maxsubcldiff=1
that the number of copies of a segment can not differ between T1 and T2 by more
than 1.
These copy number configurations can be further broken down into number of (say)
maternal and paternal copies, so that the AR can be calculated once the
fractions in t are specified. This is what the prepAR calculates, a function
that is not intended to be called by the user. The function returns a list of
length equal to the number of lines in cn. For illustration, in a sample
that has 25% normal cells, 50% of cells in subclone 1 and 25% of cells in
subclone 2:
t<-c(.25,.50,.25)
ar<-prepAR( t )
then
ar[[4]]
0.2 0.8
represent the two allelic ratios expected in a segment that has 2 copies in
normal cells, and 1 copy in both T1 and T2 subclones (represented by the 4th
line in cn, or cn[4,]).
In the above call of prepAR, an assumption was made that if one parental copy is
present in equal or more copies than the other in one subclone, then it also has
to be present in equal or more copies in all subclone (i.e. the allelic ratio is
in the same direction in all subclones). This assumption can be relaxed with:
ar<-prepAR( t , preserveMatPat=F )
ar[[4]]
0.2 0.4 0.6 0.8
These allelic ratios further describe situations where, e.g., the maternal copy
was lost in subclone T1 while the paternal copy was lost in subclone T2
(independent clones), whereas with preserveMatPat=T (the default), the same
parental copy is assumed to have been lost in all subclones (nested subclones).
On the x-axis, segments with copy number configuration cn[4,] would be
expected to have a mean SRC value of
sum( cn[4,]*t )*S/2
Together, these quantities provide us with a set of expected positions of the
points (SRC,AR) (the red dots in Figure4.png and Figure5.png). We aim to find
the parameters S and t for which the observed segments or peaks are best
captured by the model.
Once S and t are found, the ploidy of the tumour is calculated to be
ploidy<-( 2/S - 2*t[1])/(1-t[1])
###############################################################################
In this section we illustrate how the segment-based objective function is
calculated. It is best to first read tutorial.3.search_solutions.txt.
For illustration purposes, let's plot the segments onto a contour plot, and add
to the graph the one-clone solution above.
par(mfrow=c(2,1))
prepCN(12)
cntr<-showTumourProfile(copyAr, maxPoints=50000 , flatten=.25 , nlev=20,
xlim=c(0,2) , nx=200, ny=50 )
# only focusing and plotting on large segments
sel<-t.ar.seg$size>10000000
subsegments<-t.ar.seg[sel,]
points(x<-subsegments$mean, y<-subsegments$p, pch=21, col="white", lwd=3, cex=2)
points( x, 1-y, pch=21 , col="white", lwd=3 , cex=2 )
ePP<-plotModelPeaks( par=ls1[[1]]$par ,cn=cn, epcol="red",epcex=1,eplwd=3 )
We define an objective function that takes values between 0 and 1, taking the
value 1 if 100% of the genome in subsegments is closely captured by the set of
expected peaks (the red dots) and 0 if all the of the genome is distant from the
model.
The distance between each segment and its closest expected peak is calculated.
If that distance is 0, that segment is assigned a weight of 1. The larger the
distance, the smaller the weight, down to a weight of 0 if that distance is
more than half the distance between two consecutive expected peak positions
(projected on the x-axis).
These distances (column $dist) and weights (column $we) are calculated
when the annotateSegments function is called
subsegments<-annotateSegments(subsegments, ePP)
To illustrate, the segments points above can be plotted with size proportional
to their weight:
# only plotting large segments
image(cntr, col=terrain.colors(20))
contour(cntr, nlev=20, add=T)
sel<-t.ar.seg$size>10000000
points( x<-subsegments$mean, y<-subsegments$p, pch=21 , col="white", lwd=2 ,
cex=2*subsegments$we )
points( x, 1-y, pch=21 , col="white", lwd=2 ,cex=2*subsegments$we )
plotModelPeaks( par=ls1[[1]]$par ,cn=cn, epcol="red",epcex=1,eplwd=3 )
The objective function that would need to be maximized is defined as:
sum(subsegments$size*subsegments$we)/sum(subsegments$size)
and would here take the value:
[1] 0.6672967
The weights can decrease either linearly or quadratically (the default) with the
distance.