-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtutorial.3.search_solutions.txt
409 lines (297 loc) · 14.1 KB
/
tutorial.3.search_solutions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
0 1 2 3 4 5 6 7 8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
###############################################################################
## if the user has saved the Rda files, they can be loaded:
## files<-system("ls Rda/*.rda", intern=T); for( f in files){load(f)}
###############################################################################
The present file describe ways to obtain solutions for the ploidy, cellularity
and subclonal proportions. Two objective functions (functions that need to be
minimized with respect to the parameters) are described, one that is segment
-based and one that is peak-based. The pipeline.r file uses the former.
When a fully automatic search of solutions fails to provide a satisfying
answer, perharps because of extended noise or subclones, celluloid offers the
user the possibility of using manual interventions for greater control. These
are further described when presenting the peak-based objective function.
###############################################################################
1) Segment-based objective function
To find ploidy and cellularity estimates, we work with a set of segments, such
such as those in t.ar.seg. The two columns "mean" and "p" from t.ar.seg are
used for the fit.
We first search for a ploidy/cellularity solution by assuming a one clone model
(all tumor cells are derived from a single clone). The following function does
the search. To reduce the amount of noise, the user might want to first focus on
large segments:
sel<-t.ar.seg$size>10000000
subsegments<-t.ar.seg[sel,]
## set.seed(11111)
sol1<-coverParamSpace( segments=subsegments, optimFct=2, lowerF=c(0),
upperF=c(1),Sfrom=.25, Sto=2 , maxc=12 , control=list( maxit=1200 ) )
## save(sol1, file="Rda/sol1.rda")
The function tries to maximize an objective function (described below) over the
parameter space.
When optimFct=2, the search relies on a simulated annealing algorithm (GenSA).
The function will search for a global maximum.
The lowerF and upperF arguments are the lower and upper limits for the t
parameter vector, ignoring the last entry (ie, lower and upper bounds for
t[1:nsubclone]; here we assumed a single clone, so the t parameter is only t[1],
the percentage of normal cells).
The Sfrom and Sto arguments are the lower and upper bounds for S (the scaled
read counts expected in segments that have two copies in all cells; the
autosomal ploidy of the sequenced sample is 2/S). Thus, in the above call, the
ploidy of the sequenced sample is allowed to go from 2/Sto=1 to 2/Sfrom=8. The
percentage of normal cells in the sequenced sample is allowed to vary from 0%
(lowerF=c(0)) to 100% (upperF=c(1)).
The control argument is passed to the optimization function used. See ?GenSA
for details (if optimFct=2; see ?optim if optimFct=1).
The argument maxc represents the maximum number of copies of a segment that can
be found in any cell. Set it so that it's greater than the largest ploidy in the
search.
The maxc argument is passed to the function prepCN, a function called from
within coverParamSpace that creates a global data.frame with the name "cn" that
describes the allowed copy number configurations between the normal cells and
the tumour cells.
E.g., the call of prepCN(maxc=12) creates the data.frame:
cn
N T1
1 2 0
2 2 1
3 2 2
4 2 3
5 2 4
6 2 5
7 2 6
8 2 7
9 2 8
10 2 9
11 2 10
12 2 11
13 2 12
The first line represents a configuration where normal cells (column N) have
two copies and tumour cells (column T1) have 0 copies in a segment. The next
line represents a configuration where normal cells have two copies and tumour
cells 1 copy, etc. In the search, a segment is only allowed to have one of
these configurations.
The coverParamSpace function creates a global matrix called paramSpace that
contains the value of the objective function (first column) and the value of
c(S,t) in the other columns, at each iteration. It can be appended to between
different runs using the flag addToParamSpace=T. Plotting the value of the
objective function and the different parameters can help determine if the
parameter space was well covered and if other local minima are worth refining:
plot( as.data.frame(paramSpace), pch='.' )
or in details, for chosen parameters, eg,
plot( paramSpace[,2], paramSpace[,1], pch=19,cex=.5, xlab="S", ylab="value" )
See FigureN2.png.
The coverParamSpace function returns a list of lists containing among other
things the optimal parameters and the value of the objective function. It has
length equal to the number of starts (controled by nrep, here 1 by default).
I.e.,
sol1[[1]]$value
[1] 0.6555216
is the optimal value of the objective function and
sol1[[1]]$par
[1] 0.63470758 0.03951727
contains the optimal parameter vector c(S, t[1],..,t[nsubcl] ):
S<-sol1[[1]]$par[1]
t<-sol1[[1]]$par[ 2:length(sol1[[1]]$par ) ]
t<-c( t, 1-sum(t) )
S
[1] 0.6347076
t
[1] 0.03951727 0.96048273
The sequenced sample thus contain 4.0% normal cells and 96.0% tumor cells.
The ploidy of the tumor can be calculated as:
ploidy<-( 2/S - 2*t[1])/(1-t[1])
ploidy
[1] 3.198415
The list also includes paramSpace: sol1[[1]]$paramSpace.
A visual assessment of the solution can be obtained with
showTumourProfile(copyAr, maxPoints=50000 , flatten=.25 , nlev=20,
xlim=c(0,2) , nx=200, ny=50 )
plotModelPeaks( par=sol1[[1]]$par, cn=cn, epcol="red",epcex=1,eplwd=3 )
See FigureN3.png. Red vertical lines represent the integer copy number values in
tumors corresponding to the scaled read counts. The red dots represent the
values of allelic ratios for each copy number configuration found in cn. More
details and options can be found in
tutorial.4.display_solutions.txt
##
Inspection of FigureN2.png reveals that local maxima near S=0.3 might not have
been well covered by the simulated annealing search. The user can either use a
larger number of iterations, use grid search (see tutorial.5.other_options.txt)
or refine the search around that S value:
# set.seed(22222)
sol1.refined<-coverParamSpace( segments=subsegments, optimFct=2, lowerF=c(0),
upperF=c(1), Sfrom=.2, Sto=.4 , maxc=12 , control=list( maxit=400 ),
addToParamSpace=T )
# save( sol1.refined, file="Rda/sol1.refined.rda")
plot( paramSpace[,2], paramSpace[,1], pch=19,cex=.5, xlab="S", ylab="value" )
As a function of S, the global maximum of the objective function is well
defined.
##
Alternatively, instead of refining, the user can grab the parameters associated
with each point in
plot( paramSpace[,2], paramSpace[,1], pch=19,cex=.5, xlab="S", ylab="value" )
by using
id<-identify( paramSpace[,2], paramSpace[,1] , n=1 )
(click a point on the graph to identify it, say the local maximum near S=1.2).
Then the parameters corresponding to that point are
paramSpace[id,]
[1] 0.6078658 1.2472557 0.5052600 0.4947400
This solution can then be plotted to see if further refinement is needed:
showTumourProfile(copyAr, maxPoints=50000 , flatten=.25 , nlev=20,
xlim=c(0,2) , nx=200, ny=50 )
plotModelPeaks( S=1.2472557, t=c( 0.5052600 ,0.4947400 ),
cn=cn, epcol="red",epcex=1,eplwd=3 )
#####
Notice that the solution in sol1 did not capure the apparent peaks located
between 2 and 3 copies, and between 3 and 4 copies. The user can verify that
the next best solution in sol1.refined[[1]]$par is unsatisfactory.
The search can be refined around the solution found above to allow for two
subclones, in order to see if these extra peaks can be captured
## set.seed(33333)
sol2<-coverParamSpace( segments=subsegments, optimFct=2, lowerF=c(0.0395,0),
upperF=c(0.0396,.96),Sfrom=.6347, Sto=0.6348 , maxc=6 , maxsubcldiff=1,
control=list( maxit=100 ) )
showTumourProfile(copyAr, maxPoints=50000 , flatten=.25 , nlev=20,
xlim=c(0,2) , nx=200, ny=50 )
plotModelPeaks( par=sol2[[1]]$par, cn=cn, epcol="red",epcex=1,eplwd=3 )
The user can check if the parameter space (only involving the new parameter
representing the % of subclone 1) was well covered:
plot( paramSpace[,4], paramSpace[,1], pch=19, xlab="% Subclone 1",
ylab="value" )
## save( sol2, file="Rda/sol2.rda")
The value maxc=6 was chosen based on the one clone solution.
If >= 1, maxsubcldiff represents the upper bound for the difference in copy
number between any two subclones. If < 1, it represents the upper bound for the
ratio between the minimum and maximum copy number seen across all subclones
(note that when maxsubcldiff<1, a copy number 0 is treated as being 1 in the
calculation of this ratio). Maxc and maxsubcldiff are used in prepCN, a function
called from within coverParamSpace that creates a global data.frame with the
name "cn" that describes the allowed copy number configurations between all
subclones.
Eg, prepCN(maxc=6 , nsubcl=2 , maxsubcldiff=.5 ) creates the cn data.frame:
cn
N T1 T2
1 2 0 0
2 2 1 0
3 2 2 0
8 2 0 1
9 2 1 1
10 2 2 1
15 2 0 2
16 2 1 2
17 2 2 2
18 2 3 2
19 2 4 2
24 2 2 3
25 2 3 3
26 2 4 3
27 2 5 3
28 2 6 3
31 2 2 4
32 2 3 4
33 2 4 4
34 2 5 4
35 2 6 4
39 2 3 5
40 2 4 5
41 2 5 5
42 2 6 5
46 2 3 6
47 2 4 6
48 2 5 6
49 2 6 6
The first line represents a configuration where normal cells have two copies
and both subclones have 0 copies in a segment. The next line represents a
configuration where normal cells have two copies, the first subclone has 1 copy
and the second subclone has 0 copies, etc. The above cn data.frame describes
situations where the copy number in a subclone is only allowed to be at most
double the copy number of any other subclone.
The above call to coverParamSpace used maxsubcldiff=1, i.e. the assumption was
made that segments can not differ by more than 1 copy between the two subclones.
The function returned
sol2[[1]]$par
[1] 0.63470000 0.03953571 0.57168159
S<-sol2[[1]]$par[1]
t<-sol2[[1]]$par[ 2:length(sol2[[1]]$par ) ]
t<-c( t, 1-sum(t) )
S
[1] 0.6347
t
[1] 0.03953571 0.57168159 0.38878270
Allowing two tumor populations or subclones, the percentage of normal cells is
4.0%, the percentage cells in the first subclone is 57.2% and the percentage of
cells in the second subclone is 38.9%.
Using maxsubcldiff = 0.5 leads to more configurations in cn, and would lead to
additional solutions that would need to be manually curated. We tend to favor
parcimony first, then expanding cn if a satisfactory solution can't be found.
The user can verify this:
sol2.2<-coverParamSpace( segments=subsegments, optimFct=2, lowerF=c(0.0395,0),
upperF=c(0.0396,.96),Sfrom=.6347, Sto=0.6348 , maxc=6 , maxsubcldiff=0.5,
control=list( maxit=100 ) )
plot( paramSpace[,4], paramSpace[,1], pch=19, xlab="% Subclone 1",
ylab="value" )
#####
Instead of first finding a one-clone solution and then refining to find a
two-subclone solution (a stepwise approach that we recommend), the search could
have been done by directly by calling a two-subclone model:
## set.seed(44444)
sol3<-coverParamSpace( segments=subsegments, optimFct=2, lowerF=c(0,0),
upperF=c(1,1),Sfrom=.25, Sto=2 , maxc=6 , maxsubcldiff=1,
control=list( maxit=1000 ) )
however, manual curation is more difficult and challenging, and many more
iterations would be needed to garantee good coverage of the parameter space.
See tutorial.5.other_options.txt to search using optim() with grid-defined
starting points.
###############################################################################
###############################################################################
2) Peak-based objective function.
We define a "peak" to be a local maximum in a tumour profile contour plot,
either an observed peak (as seen in the contour plot) or an expected peak
(calculated in ways described in tutorial.1.background).
The objective function is described below. The advantage of using a peak-based
analysis as opposed to the set of segments is that the contour plot smooths out
a lot of noise. Also, the analysis is not driven by large segments: small,
informative ones can contribute as much than large ones if they are captured in
a peak.
A graphic window with a tumour profile must be displayed on screen:
cntr<-showTumourProfile(copyAr, maxPoints=50000 , flatten=.25 , nlev=20,
xlim=c(0,2) , nx=200, ny=50 )
axis(1)
The following function is used to select "peaks" that will enter the analysis.
With getLocal=F the user is prompted to manually select peaks with left
clicks. The selection ends with a right click.
sp <- selectPeaks( cntr, copyAr , getLocal=F )
Alternatively, peaks can be automatically selected with getLocal=T. The
function will make nrand (default 100) calls to optim() with
random starting points to find the local maxima.
sp <- selectPeaks( cntr, copyAr , getLocal=T ,
percentMax=.33, manual=F, nrand=200 , filtersymm=T )
where percentMax raises the "sea level" so that smaller peaks and noise are
ignored. Here it is set to be at 33% of max( cntr$z ) (which may have been
"flattened", depending on the call of showTumourProfle). If manual=T
then the user is asked to choose additional points.
Note that the graph is symmetrical. Try to only select one of the two
symmetrical points, but if not the function will do it for you.
The function returns a data.frame containing the coordinates that we
manually selected:
sp
x y
1 0.3301212 0.01
2 0.6290624 0.01
3 0.8208360 0.38
4 0.9392844 0.33
5 1.1197772 0.50
6 1.2382256 0.50
7 1.5428072 0.43
See Figure1.png.
## save( sp, file="Rda/sp.rda")
We aim to find the set of parameters for which the observed peaks are best
captured by expected peaks. The objective function to minimize is the total
(Euclidean) distance between the observed peaks and their closest expected
peaks.
The coverParamSpace function does the search, this time by using the
selectedPeaks argument (instead of the segments arguments for the segment-based
search). The only other difference is that the objective function needs to
to be minimized.
sol4<- coverParamSpace( selectedPeaks=sp , optimFct=2, lowerF=c(0), upperF=c(1),
Sfrom=.25, Sto=2 , maxc=12 , control=list( maxit=1000 ) )